Replay A tt ack Detection Using Integrated Glo tt al Excitation Based Group Delay Function and Cepstral Features

: The automatic speaker veri ﬁ cation system is susceptible to replay a tt acks. Recent literature has focused on score-level integration of multiple features, phase information-based features, high frequency-based features, and glo tt al excitation for the detection of replay a tt acks. This work presents glo tt al excitation-based all-pole group delay function (GAPGDF) features for replay a tt ack detection. The essence of a group delay function based on the all-pole model is to exploit information from the speech signal phase spectrum in an e ﬀ ective manner. Further, the performance of integrated high-frequency-based CQCC features with cepstral features, subband spectral centroid-based features (SCFC and SCMC), APGDF, and LPC-based features is evaluated on the ASVspoof 2017 version 2.0 database. On the development set, an EER of 3.08% is achieved, and on the evaluation set, an EER of 9.86% is achieved. The proposed GAPGDF features provide an EER of 10.5% on the evaluation set. Finally, integrated GAPGDF and GCQCC features provide an EER of 8.80% on the evaluation set. The computation time required for the ASV systems based on various integrated features is compared to ensure symmetry between the integrated features and the classi ﬁ er


Introduction
Speaker recognition refers to recognizing person from their voices.Speaker recognition involves speaker identification and speaker verification.The process of speaker identification is to identify the person from a known set of voices.Speaker verification verifies the claimed identity of an individual.Recent challenges in the Automatic Speaker Verification (ASV) system involve the detection of spoofing attacks.Amid different spoofing attacks, replay attack detection is more challenging in the case of the automatic speaker verification (ASV) system.Also, it is easy for attackers to mount the reply attack as it does not involve any specific expertise.In recent years, there have been many efforts from the research community for countermeasures and recognition of the reply attack.Major efforts to detect replay signals are made on the ASVspoof 2017 challenge, which a group of researchers held during INTERSPEECH 2017 [1].Afterwards, ASVspoof 2017 version 2.0 was made available, with certain meta-data modifications [2].For the ASV-spoof 2017 database, a few phase-and magnitude-based features along with fused combination strategies were suggested.CQCC, MFCC, IMFCC, SCMC, and SCFC are a few of the magnitude-based features proposed in [3].The combined system CQCC, Mel-RP, and PBSFVT features are presented in [4].In Interspeech 2018, frequency modulation features [5] and Linear Prediction features based on frequency domain [6] are demonstrated.In [7], scorelevel fusion of CQCC and AWFCC is reported for improved performance for the detection of replay attack.Score level fusion of power function-based features and CQCC is proposed in [8].
Recently, many systems have been proposed for the detection of replayed signal from genuine speech.Such systems are categorized into two types, one type which focus on features and other on the classifiers [9].One of the attributes of the taxonomy presented in [9] mentioned use of multiple features for improved performance [9].Another attribute mentioned in [9] is fusion.Fusion can be performed at score level or at feature level [9].Serial feature fusion of multiple features has been demonstrated in [10] by current authors.There have been many ASV systems proposed which are based on use of multiple features and combining them at score level for replay attack detection.These efforts are studied in following section.

Related Work
There have been several approaches proposed based on the use of phase information.The effectiveness of relative phase information-based features, namely, LPR-RP and LPAES-RP is demonstrated in [11].In [11], authors have mentioned that LPR-RP and LPAES-RP features fused with CQCC at the score level are effective.The authors in [12] have proposed linear prediction residual consisting of excitation source information based MFCC (RMFCC) features.RMFCC and CQCC features fused at score level reported better performance in [12].The Hilbert envelope based and residual phase features have been fused at score level [13].
The efficacy of instantaneous frequency-based features has been mentioned in the recent work.Features based on instantaneous amplitude and instantaneous frequency are proposed in [14].These features are extracted using energy-separation algorithm and ESA-IFCC features are fused with CQCC at score level [14].Some of the approaches based on instantaneous frequency-based feature are demonstrated in [15,16].
Score level fusion of source, instantaneous frequency and cepstral features is proposed in [15].In [16], Instantaneous Frequency based Cochlear Filter Cepstral Coefficients are proposed.Multiple features CQCC, CFCC, CFCCIF, CFCCIF-ESA, CFCCIF-QESA are fused at the score level in [16].Use of multiple features AFCCsFAF, ARPDBF, and CQCC at the score level is proposed in [17].Glottal MFCC and shifted CQCC based features fused at the score level in [18].Teager energy-based features are fused at the score level with cepstral features in [19].Autoencoder reconstructed features when fused at the score level with MFCCs, CQCCs, SCMCs, CCCs, LPCCs, IMFCCs, RFCCs, LFCCs, SCFCs, and spectrogram are effective.This has been demonstrated in [20].Effectiveness of gammatone-scale relative phase combined with CQCC at the score level is presented in [21].
Recent studies have also proposed the enhancement of existing feature extraction approaches.Based on constant Q-transform three concatenated features, CQSPIC, CQEPIC, and CESPIC are proposed in [22].In [23], improved ETECC features have been proposed for replay attack detection.Glottal information based CQCC features have been proposed with importance of high frequency band in [24].In [24], frequency band 7-8 kHz is considered for glottal information based CQCC features.The importance of highfrequency features in the computation of CQCCs is demonstrated in [25,26].Replay attack detection using 2D-ILRCC features as the enhancement of source based RAD features proposed in [27].
From the extensive literature survey, it is observed that use of multiple features and fusion at score level is widely carried out.Some of approaches have focused on sub-band analysis and enhancing baseline feature extraction schemes.Recent approaches have focused on Glottal information-based features using iterative adaptive inverse filtering [18,24].It is mentioned in [9] that fusion can enhance replay attack detection system performance, but it also adds more tasks to the fusion computation, increasing computation time.The motivation for this work is to analyze the trade-off between computation time and the detection rate.This work examines the score level fusion approach considering the computation time.This work presents the computation time for the systems that performed better in terms of detection rate.The symmetry between integrated features and classifier is necessary for computation time.
This work evaluated the performance of ASV system by integrating cepstral features, linear prediction-based features at score level.Unlike the approach mentioned in [10], score level fusion is evaluated by integrating different systems based on cepstral and LPC based features.In this work, CQCC features are evaluated with high frequency band (6-8 kHz) and score level integration is evaluated considering cepstral and linear predictionbased features.This work also evaluated IMFCC, SCMC, SCFC, RFCC [3] and APGDF [28] based features.The performance of IMFCC, SCMC, SCFC, and RFCC features has been evaluated in [3].However, all pole group delay function-based features are less popular for replay attack detection.This work proposes two different approaches, one is based on GCQCC with cepstral mean and variance normalization (CMVN).Second is Glottal information based APGDF (GAPGDF) features with CMVN.
The structure of this paper is as follows.Section 1 presented the introduction.Section 2 presents the related work, followed by methods in Section 3. The experimentation and outcomes are covered in Section 4. Section 5 provides a summary and conclusion of the paper.

Glottal Flow Derivative
Speech is thought of as the time-varying vocal-tract system's reaction to a stream of airflow passing through the glottis, the slit-like entrance of the vocal folds, in terms of signal processing [18,24,29].Airflow through the glottis or glottal flow stimulates the vocal tract system.The lip radiation, corresponding to first order differentiation, filters the vocal tract system's output even further [18,24,29].Due to the lip's differentiation property, the glottal flow derivative (GFD) in uttered speech is the form in which the excitation can be observed.Thus, the glottal flow's derivative is frequently regarded as the excitation signal representation in speech processing tasks [18,24,29].Glottal flow and its derivative waveforms can be referred from [18,24,29].The phases open, closed, and return are the three distinct sections that make up the full cycle.
The closed phase is the state in which there is no airflow and the vocal folds are completely closed.During the open phase, there is non-zero airflow and either fully or partially opened vocal folds.The return phase, which starts when the speech mechanism has finished producing speech, is the period between the glottal closure time and the glottal flow derivative's most negative value.A sequence of glottal cycles forms excitation.Consequently, the three levels of the excitation information, i.e., between, within, and across the glottal cycle, are reflected in the glottal flow derivatives.The information about the glottal cycle comprises the timed activities, like the glottal flow duration and the instants of closure and opening [18,24,29].Between the subsequent glottal cycles, information about the pitch average and epoch strengths is reflected.High level information can be detected across the many glottal cycles, such as prosody and intonation [18,24,29].As mentioned in [18,24], the numerous devices and components employed during the replay configuration setup have an impact on all these information.The GFD signal's amplitude, frequency, and shape may all be altered by these disturbances [18,24].As mentioned in [18,24], it can be forecast that for the purpose of detecting replay signals, the use of the GFD signals from replay and genuine voice samples may be helpful.
An estimate of the glottal contribution is obtained in the first iteration by the computation of a first order LPC model.To produce a more accurate model for the glottal contribution, the higher order LPC model is computed in the second iteration.The details of the IAIF method can be found in [24].To examine how the replay mechanism affects the information from the excitation source, an analysis is conducted on the temporal and spectral representation of the replay and genuine signal.The GFD signal representations in the temporal and spectral domains, as computed from genuine (T_1000003.wav) and replay (T_1001511.wav)speech pairs obtained from the ASVspoof 2017 version 2.0 database, are displayed in Figure 1.From Figure 1a, a replayed/spoof speech signal is distorted in amplitude and periodicity in temporal representation.Also, spectral representation is more distorted for replayed/spoof speech.Figure 2 shows the spectrograms of genuine (Figure 2a) and spoofed (Figure 2b) speech signals after processing with the IAIF method.It is evident from the spectrogram that spoof speech signal loses periodicity (as shown in rectangular shape) and concentration of speech signal is less (as shown in oval shape) as compared to genuine speech signal.These differences may help for distinguishing genuine signal from a spoofed signal.

Cepstral Mean and Variance Normalization
Robustness is a crucial factor in assessing how well replay detection methods work [39].As mentioned in [39], the mismatch between the two datasets usually results in a decline in the detector's performance when a spoof/replay detection method trained on one dataset is applied to sounds from another dataset.This is because of different background/channel noises.CMVN is commonly used in replay attack detection in various approaches.Such efforts are [12,[17][18][19]23,40]."CMVN eliminates the convolution noise in the temporal domain, such as channel distortion, and the channel noise in the cepstral domain, which corresponds to the additive deviation of the cepstral domain" [39].Each training and test sample is converted to zero mean and unit variance using CMVN, which matches mean and variance CMVN [39,41].
Consider  be the N-dimensional cepstral feature vector at time t. () represents the ith component of  .y = { ,  , …,  , …...,  } represents the voice segment of length T. "CMVN first calculates the mean µ and the variance σ 2 using the maximum likelihood estimate for each feature dimension" [39], Now, each dimension feature vector is normalized,

Group Delay Function
The phase spectrum's negative derivative is known as the group delay function [42].If x(n) is a frame of speech, then its Fourier transform representation in polar form is as follows, Here, the magnitude spectrum is |X(ω)| and θ(ω) represents the phase spectrum.The continuous phase function's negative derivative is the group delay function.
Figure 3 shows the group delay spectrum of the genuine (Figure 3a) and spoof (Figure 3b) signal.Closely spaced higher formants are visible for genuine and spoof speech.It is evident from Figure 3 that peaks are dominant in the spectrum of spoof signal.This might be due to addition of channel noise produced by the recording device which is being added or integrated into the speech signal.

Group Delay Function of All-Pole Models
The speech spectrum is approximated in linear prediction analysis using an all-pole model [42,43]."An equivalent representation of the vocal tract as a cascade of several second-order and first-order all-pole filters is possible when it is thought of as an all-pole filter" [42,44].The product of the magnitude spectra of each individual filter yields the overall magnitude spectrum in this representation.Individual phase spectra are added together to form the overall phase spectrum, which in turn forms the group delay spectrum [42].
For the equation 6, the linear prediction can be specified as, for the power spectrum |X(ω)| 2 , find the coefficients set a(k) so that in a least-squared sense, the speech power spectrum and H(ω) power spectrum match [42,43].Here, signal dependent gain is G.The model order is p.The filter formed by H(ω) has both a magnitude response and a phase response.This filter's group delay function is known as the all-pole group delay function."Closely spaced higher formants can be captured because of the group delay function's high-resolution characteristic" [42].In the group delay spectrum, the higher order formants are more noticeable, especially under low and high vocal effort conditions [42].As mentioned in [42,45], by applying DCT, the APGDF is converted into cepstral coefficients.The Figure 4 shows the feature extraction process.

Classifier
GMM is still the most widely utilized classifier [9].The GMM classifier (CQCC + GMM) is also provided by the ASVspoof 2017 challenge.This work employed GMM as a classifier to discriminate between genuine and spoofed speech.One GMM for genuine and another one for spoof utterances are trained using 512-component models.This training is carried out using the Expectation-Maximization (EM) algorithm with random initialization.The log-likelihood ratio for the test utterance is used to generate the score, considering both the genuine and spoofed speech models.Implementation of GMM is available from VLFeat [46].

Evaluation Metric
Using the Bosaris toolbox [47], the ASVspoof 2017 challenge [1] has provided an evaluation metric for baseline systems called the Equal Error Rate (EER).In the recent years, EER has been the main criteria for the evaluation of replay attack detection systems [9].In this work, EER is used as an evaluation metric for the computation of the performance of the system.More details on the EER can be found in [18].

Database
The ASVspoof 2017 version 2.0 database [2] has been used for the experimentation.This database contains genuine utterances from the RedDots corpus [2].A variety of heterogeneous devices and acoustic environments are used for replaying and recording bona fide utterances which results in spoofed utterances [2].This database has three non-overlapping sections, training, development, and evaluation subsets.More information about the database is available in [2].Table 1 describes the database.

Experiment Setup
To conduct the experiments, MATLAB version R2021a software is used on the Windows 11 operating system.The system used was a Lenovo Legion Core i7 10th Generation laptop.

Baseline System
Current authors have esvaluated baseline CQCC [2] and LFCC [48] methods in [10] along with MFCC, LPC, and LPCC methods.The results of the work are shown in Table 2.In all experiments, DevEER (%) presents % equal error rate achieved on the development set and EvalEER (%) present % equal error rate achieved on the evaluation set.In [10], the current authors have integrated cepstral and LPC based features for various combinations at the serial level.This work integrates cepstral and LPC based features at the score-level.

Score-Level Integration of Cepstral and LPC Based Features
This section presents the results of the integration of cepstral and LPC based features at the score level.The parameters considered are the same as those mentioned in [10] for the evaluation of cepstral and LPC based features as shown in Table 3.The MFCC, LPC, and LPCC are referred from voicebox toolbox [49].In addition to the cepstral and LPC based features, timbrel feature set formed using various features are demonstrated in [10].This work explores the efficacy of these features at score level integration.The 90-dimensional zero-crossings feature set (ZCR) is also evaluated along with the timbrel feature set.These timbrel feature set included, brightness [50], entropy, event density, flatness, inharmonicity, kurtosis, pitch, irregularity [50], rolloff [50], rms, skewness, and spread.These features are extracted using the MIR toolbox [51].The basic idea for score level integration is to integrate the scores computed from different systems.For example, CQCC + GMM is one system, MFCC + GMM is another system, and so on.Table 4 presents the results of the experiments carried out by integrating the Cepstral domain, LPC, Zero Crossings and Timbre feature set at the score level.From Table 4, on the development set, integration of CQCC, LFCC, MFCC, LPC, and LPCC features achieved an EER of 5.44%, and integration of ZCR, MFCC, CQCC, LFCC, and LPC features achieved an EER of 18.33% on the evaluation set.It is evident that cepstral and LPC-based features are more prominent than timbrel features when integrated at the score level.

High Frequency Band
In the process of spoof signal generation, original speech undergoes degradation due to the impulse response of the microphone and playback device as well as the environment's acoustic properties [26].This hypothesis is analyzed to discriminate the replayed speech from genuine speech using spectral features derived from several frequency subbands in [26].The work presented in [24][25][26] has focused on information present in the high-frequency band.The high-frequency band 6-8 kHz is shown to be more useful for replay attack detection.CQCC features are extracted considering the frequency band of 6-8 kHz [24][25][26].This work evaluated the performance of CQCC features considering the frequency band of 6-8 kHz.Also, various cepstral and LPC-based features are integrated with high frequency based CQCC features.Table 5 presents the parameterization used for the extraction of high frequency based CQCC features.This parameterization has been demonstrated in [52].As observed in Table 6, high frequency based CQCC features resulted in better performance than baseline CQCC features.On the development set, integrated features CQCC, LFCC, MFCC, LPC, and LPCC achieved an EER of 4.81% and 13.45% on the evaluation set.From these results, it is observed that integration of high frequency based CQCC features with cepstral and LPC based features performs better at score level fusion.

APGDF, IMFCC, RFCC, SCFC and SCMC Features
For replay attack detection, spectral sub-band centroid based features (SCMC and SCFC), Inverted MFCC, and RFCC features are evaluated in [3].In [28], APGDF features are examined for synthetic speech detection.This work evaluated the performance of these features to examine their effectiveness.Table 7 shows the parameters considered for the respective features.The implementation of features mentioned in Table 7 are available in [28,53,54].The Figure 6 shows experiment results of APGDF, IMFCC, RFCC, SCFC and SCMC features.As observed from Figure 6, though the results are less prominent, these features can provide complementary information, so they are integrated with the cepstral domain and LPC domain features.This is because results achieved on the development set are more prominent than baseline methods.As the group delay function based on the all-pole model provides closely spaced higher formants [42], this work uses APGDF features for replay attack detection.

Score Level Integration of All Features
It is evident that high frequency based CQCC features resulted in better performance as compared to baseline CQCC.Moreover, integrated features based on high frequency based CQCC, LFCC, MFCC, and LPC based features (LPC and LPCC) resulted in improved performance on the development and evaluation set.This proves that integrated features at the score level are effective.Further, to obtain complementary information, features based on the Sub-band Centroid (SCMC and SCFC), Inverted MFCC, RFCC, and APGDF are integrated.Table 8 shows the result of the integration of these features.As observed in Table 8, an EER of 3.08% is attained on the development set for high frequency based CQCC features integrated with cepstral features, LPC based features, IMFCC, RFCC, APGDF, SCMC, and SCFC features.On the evaluation set, an EER of 9.86% is achieved for the same integrated features.

Glottal Excitation Based CQCC Features with CMVN
Next, this work evaluated the performance of glottal excitation based CQCC features.These features are evaluated in [24].However, in this work, the performance of GCQCC features with the CMVN technique is evaluated.CMVN implementation referred from MSR Identity Toolbox [55].For glottal excitation computation, IA-IF method from the COVAREP repository [56] version 1.4.2 is used.The CQCC parameters considered are the same as the baseline CQCC.Nevertheless, this work evaluated the performance of GCQCC features by varying the number of cepstral coefficients, including static, delta, and double delta coefficients.Figure 7 shows the results of GCQCC with the CMVN technique.As shown in Figure 7, on the development and evaluation set, 120 cepstral coefficients (40 + 40 Δ and 40 ΔΔ) resulted in better performance.So, in the next subsequent experiments, GCQCC with the CMVN technique considered with 120 cepstral coefficients.

All Pole Group Delay Function Features (APGDF) with CMVN
The APGDF features with the CMVN technique are evaluated on the development dataset by varying frame length and LPC order.The work presented in [57,58] has shown frame length beyond 20 milliseconds is useful for speaker recognition performance.This work evaluates the frame length for different sizes, which may be useful for the detection of a reply attack.Figure 8 shows the % EER computed for APGDF with the CMVN technique by varying LPC order with a frame length of 20 msec.The % EER computed for APGDF with the CMVN technique by varying LPC order with a frame length of 30 msec is shown in Figure 9. Whereas, Figure 10 shows the % EER computed for APGDF with the CMVN technique by varying LPC order with a frame length of 40 msec.The % EER computed for APGDF with the CMVN technique by varying LPC order with a frame length of 50 msec is shown in Figure 11.

Glottal Excitation Based APGDF Features with CMVN
IA-IF based APGDF features with the CMVN technique are evaluated on an evaluation set.For the evaluation set, experiments are carried out considering frame length of size 40 and 50 milliseconds and the LPC order is varied.Figure 12 shows the % EER computed for GAPGDF with the CMVN technique by varying LPC order with a frame length of 40 msec.The % EER computed for GAPGDF with the CMVN technique by varying LPC order with a frame length of 50 msec is shown in Figure 13.From Figures 12 and 13, it is observed that EER 10.5% is achieved for the frame length of 50 milliseconds with LPC order 80 on the evaluation set.Hence, for the evaluation set, frame length 50 milliseconds and LPC order 80 is considered.
The proposed IA-IF APGDF (GAPGDF) features with the CMVN technique are integrated with the IA-IF CQCC (GCQCC) features with the CMVN technique.The results of the best cases are considered while integrating the features mentioned in Table 9.
The APGDF features are best for the development set and the GAPGDF features shown better performance on the evaluation set.Integrated APGDF and GCQCC features resulted in an EER of 5.48% on the development set and an EER of 8.80% is obtained for integrated GAPGDF and GCQCC features on the evaluation set.These results shows that GAPGDF features are effective for replay attack detection.

Computation Time
This section describes the computation time required for the execution of the system which is based on the score-level integration of different speech features.From the results presented in the last sections, it is evident that different speech features integrated at the score level improve the performance of the ASV system against replay attack.A particular system represents an algorithm i.e., the features extracted in the training phase from genuine and spoof signals, followed by GMM modelling.Thereafter, in the testing phase, features extracted from development or evaluation data are used to compute the score using GMM models trained in the training phase.The score of such multiple systems based on various features is computed.Every system requires a certain amount of time for complete execution.The time taken by the respective algorithm that implements a particular system is presented.For example, system 1 can be considered a score computed using CQCC features, system 2 can be considered a score computed using MFCC features, and so on.In Table 10, the computation time required for the computation of the scores considering fused systems has been shown.Additionally, serial feature fusion approach presented in [10] is also considered for the comparison.This computation time is measured using the tic and toc functions available in MATLAB.From Table 10, it is evident that systems based on the serial integration of speech features require more computation time for execution.As feature vector size increases, GMM requires more time for training the model in serial integration.Whereas, score level integration of features requires less execution time as compared to serial fusion.In score level fusion, each system is trained separately so that the feature vector size is low dimension.The scores computed from each system are then integrated.However, more time will be required if more systems are trained to compute the score.As shown in Table 10, the computation time required for various integrated feature systems is shown.The proposed system based on integrated features GAPGDF and GCQCC requires less time as compared to other approaches.

Computation Complexity
An algorithm's computational complexity, or the number of steps it takes to finish an algorithm, is approximated based on the amount of the input in Table 11.
The Table 11 shows the computational complexity approximated for the proposed approach i.e., integrated GAPGDF and GCQCC features.The computational complexity will increase when more features are integrated.In terms of computation complexity, computational time, and achieved results, the proposed integrated features perform comparably better than integrating many features.

Performance Comparison between the Proposed Algorithm and Recent Approaches
The Table 12 shows performance comparison between the proposed algorithm and recent approaches on the development set and evaluation set of ASVspoof 2017 version 2. From the Table 12, on the development set, an EER of 3.08% and on the evaluation set, an EER of 9.86% is achieved for the integration of cepstral features, LPC-based features, and APGDF, RFCC, SCMC, SCFC, and IMFCC features.However, more computation time is required for such integration.The proposed system, based on the integration of APGDF and the GCQCC feature, resulted in an EER of 5.48% on the development set and 12.44% on the evaluation set.Better results are achieved for the proposed system based on the integration of GAPGDF and GCQCC features, with an EER of 8.80% on the evaluation set.The proposed GAPGDF features are prominent in terms of results and computation time.

Conclusions
This work evaluated the performance of integrated features based on the cepstral domain, LPC domain, and subband spectral centroid at the score level.The trials were conducted with the goal of lowering EER, which indicates a better defense against replay attacks.The results achieved for high frequency based CQCC features (HF-CQCC), considering a frequency band of 6 kHz to 8 kHz, are promising as compared to baseline CQCC and other cepstral domain features.Using these integrated features based on HF-CQCC, cepstral domain, LPC domain, and subband spectral centroid, on the development set, an EER of 3.08% and on the evaluation set, an EER of 9.86% is achieved.Also, glottal excitation based CQCC (GCQCC) features resulted in better performance as compared to baseline CQCC and HF-CQCC.This work proposed all pole group delay function-based features extracted using glottal excitation (GAPGDF).With an EER of 7.49%, APGDF features have demonstrated superior performance on the development set, and on the evaluation set, GAPGDF features performed better, with an EER of 10.5% as compared to baseline methods, GCQCC, and HF-CQCC features.EER of 5.48% was obtained on the development set when APGDF features were integrated with GCQCC, and 8.80% was obtained on the evaluation set when GAPGDF features were integrated with GCQCC.This shows that APGDF features are promising for improvement in the performance of the ASV system against replay attack.Though integration of multiple features based on the cepstral domain, LPC domain, and subband spectral centroid resulted in better performance on the development set, it requires more computation time as compared to GAPGDF and GCQCC integrated features, especially on the evaluation set where the database size is larger as compared to the development size.It is concluded that multiple features can be integrated at the score level for improved performance, but such integration increases computation time.Therefore, feature extracted should be efficient and effective so that the computation time required is as less as possible with improved performance.The integrated GAPGDF and GCQCC features are computationally effective considering the results achieved on the evaluation set and computation time.However, an advanced version of the algorithm may include experimentation on a real-time database.Future research will concentrate on, integration of glottal excitation based APGDF features with high-frequency glottal excitation based CQCC features on real-time database.

Figure 1 .
Figure 1.Temporal (a) and Spectral (b) representation of IAIF based Genuine and Spoof signal respectively.

Figure 2 .
Figure 2. Spectrogram of genuine (a) and spoof (b) speech signal after processing with IAIF method.

Figure 3 .
Figure 3. Group delay spectrum of genuine (a) and spoof signal (b).

3. 6 .
Proposed IA-IF based APGDF Feature Extraction The block diagram of the training phase and testing phase framework is shown in Figure 5.The training phase involves IA-IF based APGDF feature extraction from genuine and spoof speech signals, followed by a Gaussian Mixture model for genuine and spoof speech signals.In the testing phase, IAIF based APGDF feature features are extracted from test samples present in the development and evaluation sets.Scores are computed using the log-likelihood ratio, which is then used to compute the Equal Error Rate (EER).In this work, the IA-IF based APGDF features are named GAPGDF.

Figure 5 .
Figure 5. Block diagram of training and testing phase for the proposed work.

Figure 7 .
Figure 7. %EER by varying number of cepstral coefficients for GCQCC with CMVN Technique.

Figure 8 .
Figure 8. % EER computed for APGDF technique by varying LPC order with frame length 20 msec.

Figure 9 .
Figure 9. % EER computed for APGDF technique by varying LPC order with frame length 30 msec.

Figure 10 .
Figure 10.% EER computed for APGDF technique by varying LPC order with frame length 40 msec.

Figure 11 .
Figure 11.% EER computed for APGDF technique by varying LPC order with frame length 50 msec.From all the above figures (Figures 8-11), it is observed that EER 7.49% is achieved for a frame length of 40 milliseconds with LPC order 40 on the development set.Hence, for the evaluation set, frame length 40 milliseconds and LPC order 40 is considered.

Figure 12 .
Figure 12. % EER computed for GAPGDF technique by varying LPC order with frame length msec.

Figure 13 .
Figure 13.% EER computed for GAPGDF technique by varying LPC order with frame length msec.

Table 3 .
Parameters for respective features.

Table 4 .
Results of experiments carried out by Integrating various Cepstral domain, LPC, Zero Crossings and Timbre feature set at Score Level.

Table 5 .
Parameters used for high frequency based CQCC features.

Table 6 .
Results of the experiments carried out by integrating high frequency based CQCC features with cepstral and LPC based features.

Table 7 .
Features with their parameters.

Table 8 .
Performance evaluation by integrating high frequency based CQCC features with Cepstral, LPC based, Spectral Sub-band Centroid, and APGDF features.

Table 9 .
Score level integration of GAPGDF and GCQCC features and APGDF and GCQCC features.

Table 10 .
Computation time required for the score computation considering fused systems.

Table 11 .
Approximated computational complexity for the techniques.

Table 12 .
Performance comparison between the proposed algorithm and recent techniques.