Improving Voice Spoofing Detection Through Extensive Analysis of Multicepstral Feature Reduction †
Abstract
1. Introduction
- The incorporation of noncepstral features to enrich the descriptive power of the voice signal representation;
- A broader experimental analysis comprising additional dimensionality reduction techniques and a more diverse set of configurations;
- Extensive validation on the ASVSpoof 2017 v2.0 benchmark dataset, demonstrating the efficiency of the proposed techniques in various experimental conditions.
2. Related Works
3. Cepstral Feature Extraction Fundamentals
- (i)
- A pre-emphasis filter to compensate for high-frequency attenuation;
- (ii)
- Segmentation of the signal into short overlapping frames;
- (iii)
- Application of window functions;
- (iv)
- Transformation into the frequency domain, usually via Fast Fourier Transform (FFT);
- (v)
- Mapping onto perceptual or linear filter banks. The final step involves the computation of cepstral coefficients from the resulting spectral envelopes.
4. Materials and Methods
- The use of data augmentation strategies to double the size of the training set through additive noise;
- The integration of noncepstral acoustic features to complement the cepstral representations;
- The systematic experimentation with dimensionality reduction techniques applied to the final feature vector to optimize the compactness and discriminative capacity of the representation.
- Generation of Data Augmentation: In order to increase the robustness of the feature representations and improve generalization, additive Gaussian noise is applied to each original voice signal. For each sample x, a new signal is generated by , where , effectively doubling the dataset size. The noise variance is calibrated to preserve intelligibility while introducing variability.
- CC Extraction: For each , compute the matrix of cepstral features: .
- Differential Calculation: Compute the first- and second-order temporal derivatives: and .
- Projection of Features: Apply one or more projection techniques from the candidate set (e.g., PCA, MEAN, STD, SKEW, or combinations). This step is the object of systematic experimentation, and the projected features are generated using each strategy in isolation or in combination, as defined in Equation (3).
- Fusion of Cepstral Representations: Concatenate all intermediate vectors into a single feature vector representing all selected cepstral techniques:
- Fusion with Noncepstral Features: Concatenate the cepstral feature vector with the noncepstral feature vector , forming the final representation , according to Equation (6).
- Normalization of Data: Standardize using z-score normalization across the training set: , ensuring zero mean and unit variance for each dimension.
- Final Dimensionality Reduction: Apply a final dimensionality reduction technique to the normalized vector , producing a more compact version . This step is crucial to evaluate the trade-off between dimensionality and model performance and constitutes the core objective of this study.
- Training of the Classifier: The final reduced vector serves as the input for a supervised learning model, trained to discriminate between bona fide and spoofed voice signals. The classifier is fitted using labeled training data , where denotes the ground truth label. Different classification models, such as Support Vector Machines, Random Forests, and Logistic Regression, are explored under identical conditions to assess their compatibility and performance in conjunction with the dimensionality reduction strategies.
5. Parameters for the Proposed Method and Practical Instances
- Mapping functions: In the context of this study, the term mapping function refers to a transformation function that is applied to each matrix of cepstral features extracted from the voice signal. These matrices typically represent a sequence of T frames, where each row corresponds to a cepstral coefficient. The purpose of the mapping is to summarize the temporal evolution of each coefficient into a fixed-size vector. This step is crucial for converting the temporal feature maps into a format compatible with static pattern recognition models. We define this projection step as the function used in Equation (3). For our experiments, we consider four fundamental mapping functions, defined as column-wise operations: (i) projection by PCA (), which reduces the temporal dimension via orthogonal transformation; (ii) column-wise mean (), which summarizes the central tendency over time; (iii) column-wise standard deviation (), capturing temporal variability; and (iv) column-wise skewness (), reflecting asymmetry in the temporal distribution. These functions may be applied individually or combined in sequence to compose a richer projection strategy.In line with the multi-projection concept proposed in this work, we evaluate the sets of mapping functions listed in Table 2. Each configuration corresponds to a different strategy for projecting cepstral matrices into compact feature vectors. The goal is to compare the discriminative potential of different projections and their combinations in the context of voice spoofing detection.
- Noncepstral features: To enrich the representation of the voice signal beyond cepstral information, we incorporate a comprehensive set of noncepstral measures, denoted by . These features capture complementary acoustic, prosodic, and articulatory cues that may highlight inconsistencies or artifacts introduced by spoofing attacks. Based on the prior literature [39,40,41,42,43,44,45,46], we consider the following metrics:
- –
- Fundamental frequency statistics: Mean and standard deviation of , reflecting pitch dynamics over the utterance.
- –
- Harmonics-to-noise ratio (HNR): Indicates the ratio of periodic (harmonic) to aperiodic (noise) energy in the signal, often altered by synthesis or replay artifacts.
- –
- Jitter-based features:
- *
- Local jitter: Measures cycle-to-cycle variability in the pitch period.
- *
- Local absolute jitter: Captures absolute deviations in pitch period durations.
- *
- RAP jitter: Relative average perturbation across consecutive cycles.
- *
- Jitter PCA projection: Single projection summarizing all jitter-based features via PCA.
- –
- Shimmer-based features:
- *
- Local dB shimmer: Frame-level amplitude variation in dB.
- *
- APQ3/APQ5/APQ11 shimmer: Amplitude perturbation quotients calculated over 3, 5, and 11 adjacent cycles, respectively.
- *
- DDA shimmer: Mean absolute difference in amplitude across voice cycles.
- *
- Shimmer PCA projection: Single projection summarizing shimmer-related features.
- –
- Formant-based measures:
- *
- Mean and median of –: Descriptive statistics for the four primary formant frequencies.
- *
- Formant dispersion: One-third of the distance between and medians.
- *
- Arithmetic mean of formants: Mean of the median values of –.
- *
- Formant position: Standardized mean of – medians.
- *
- Formant spacing (): Minimum spacing between adjacent formants, estimated via linear regression.
- *
- VTL based on : Virtual Tract Length calculated from spacing.
- *
- Mean formant frequency (MFF): Fourth root of the product of the medians of –.
- *
- Fitch virtual tract length (FVTL): Estimate of vocal tract length derived from spectral modeling.
- Cepstral coefficient extraction techniques: In this study, we adopt the same diverse set of signal processing methods for extracting cepstral representations as Contreras et al. [8], which are widely used in speaker and spoofing detection tasks. The techniques considered include the following: Constant-Q Cepstral Coefficients (CQCCs), Mel-Frequency Cepstral Coefficients (MFCCs), their inverted variant (iMFCC), Linear Frequency Cepstral Coefficients (LFCCs) [20], Gammatone-based features (GFCCs) [47], Bark-scaled cepstral representations (BFCCs) [48], Linear Predictive Cepstral Coefficients (LPCCs) [49], and Normalized Gammachirp Cepstral Coefficients (NGCCs) [50]. For each technique, the extraction process generates a sequence of feature vectors composed of 20 static coefficients (), along with their respective first and second-order temporal derivatives ( and ), resulting in a total of 60 dimensions per frame. All features undergo Cepstral Mean and Variance Normalization (CMVN), which standardizes the distribution of each coefficient across frames to zero mean and unit variance, reducing channel and session variability. These cepstral representations are evaluated both individually and in predefined combinations, as organized in Table 3. The goal is to investigate how different spectral scales and filterbank structures influence the discriminative power of the final feature vector when used in conjunction with projection and fusion strategies tailored for voice spoofing detection.Due to text space limitations, it was not possible to consider all existing combinations of the eight CC extraction techniques. However, the most important combinations were analyzed to evaluate the performance of the developed material. We consider versions of to , in which each representation of has only one element. These versions allowed evaluation of how much the framework is enhancing the capacity of the techniques to detect dysphonia in voice signals. In addition, the other versions of CC extraction techniques should serve to confirm the ability of the proposed material to represent different features in the sound sample, which should contribute to improving its ability to detect dysphonia in these samples.
- Normalization: To assess the impact of feature scaling on the final representation , we consider two alternative strategies: standard normalization and the absence of normalization. In the first case, each feature is standardized to have zero mean and unit variance across the training set, a procedure commonly used to ensure uniform scale and numerical stability. In the second case, no transformation is applied to the original values, preserving the raw scale of the features. Mathematically, the standardized vector is given byThese two strategies are compared experimentally in order to evaluate whether scaling influences the performance of spoofing detection when applied after fusion and before dimensionality reduction.
- Dimensionality reduction: As a central contribution of this work, we investigate how different dimensionality reduction strategies impact the discriminative power of the final fused vector in the context of spoofing detection. This reduction is applied after the concatenation and normalization of cepstral and noncepstral features, with the aim of eliminating redundancy, enhancing generalization capability, and improving the computational efficiency of the classification stage.A total of eight techniques were considered, covering both projection-based and selection-based approaches:
- –
- PCA: A classical linear projection method that transforms the original feature space into a set of orthogonal components ordered by their ability to capture data variance. The first components are retained to represent the most informative directions in the data.
- –
- SVD: Similar in nature to PCA, SVD decomposes the data into singular vectors and values, retaining only the most significant components. Unlike PCA, it does not require centering the data and can be applied directly to sparse or high-dimensional matrices.
- –
- ANOVA F-value selection: A univariate statistical method that selects features by computing the ratio of variance between classes to the variance within classes. Features with the highest discriminative power, as measured by the F-statistic, are selected.
- –
- Mutual Information (MI) ranking: This technique quantifies the amount of information shared between each feature and the target class. Features are ranked based on their Mutual Information scores, and the most informative ones are retained.
- –
- Recursive Feature Elimination (RFE): A wrapper method that iteratively trains a predictive model and removes the least important features at each step. This process continues until a desired number of features is retained, prioritizing those most influential to the model’s performance.
- –
- LASSO-Based selection: A regularization-based approach that performs both feature selection and regression. By applying an penalty during model training, LASSO forces some coefficients to zero, effectively eliminating less relevant features.
- –
- Random Forest importance: An ensemble-based selection strategy that evaluates feature relevance based on the contribution of each feature to the accuracy of a collection of decision trees. Features with higher cumulative importance scores are selected.
- –
- Permutation Importance: A model-agnostic method that estimates the importance of each feature by measuring the drop in predictive performance when the feature values are randomly shuffled. Only features whose permutation leads to a significant degradation in performance are retained.
All techniques are applied independently to the complete normalized vector , producing a compressed version that is used as the input to the classifier. By comparing these methods across different reduction levels, this study seeks to understand how representation compactness influences the robustness and effectiveness of spoofing detection systems.Each dimensionality reduction strategy presents distinct advantages and limitations that may influence spoofing detection performance in different ways. Projection-based methods, such as PCA and SVD, are particularly efficient in capturing global variance and projecting the data into a dense, lower-dimensional subspace. Their main advantages include simplicity, speed, and the ability to decorrelate features; however, they are linear techniques and may not fully exploit complex nonlinear relationships present in the feature space. Selection-based methods, on the other hand, offer greater flexibility and often produce more interpretable models. ANOVA and MI rely on univariate statistical associations with the target variable, which makes them fast and model-agnostic, but are potentially limited in capturing multivariate interactions. Wrapper methods, like RFE and Permutation Importance, are more powerful in accounting for such interactions but tend to be computationally expensive and sensitive to the choice of the underlying model. Regularization-based selection via LASSO introduces sparsity and robustness to overfitting, yet may underperform when features are highly correlated. Random Forest importance benefits from ensemble-based robustness but can introduce bias towards features with more variability or cardinality.By exploring this diverse set of DR techniques, the goal is to identify which methods are more compatible with the fused cepstral and noncepstral representations used in this work, and how their intrinsic properties affect generalization, efficiency, and detection capability. For readers interested in deeper technical insights into each method, we recommend consulting dedicated reviews, such as those by Jia et al. [51], Guyon and Elisseeff [52], and Van Der Maaten et al. [53]. - Classifier: For the classification stage, we adopt a Support Vector Machine (SVM) model using a Gaussian kernel, also known as the Radial Basis Function (RBF) kernel [54]. This choice reflects its wide adoption in speaker verification and spoofing detection tasks due to its ability to model nonlinear decision boundaries effectively. The classifier is trained with class probability estimation enabled, which allows posterior probability scores to be extracted for the computation of evaluation metrics, such as the EER. To improve training stability and mitigate scale sensitivity, the input features are optionally standardized prior to model fitting, using zero mean and unit variance. Additionally, a class-weight adjustment mechanism is applied to reduce sensitivity to class imbalance, ensuring robustness even in experimental configurations that do not incorporate oversampling strategies.
6. Results and Experiments
6.1. Dataset
6.2. Internal Analysis
6.3. State-of-the-Art Comparison
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Yudin, O.; Ziubina, R.; Buchyk, S.; Bohuslavska, O.; Teliushchenko, V. Speaker’s Voice Recognition Methods in High-Level Interference Conditions. In Proceedings of the 2019 IEEE 2nd Ukraine Conference on Electrical and Computer Engineering (UKRCON), Lviv, Ukraine, 2–6 July 2019; pp. 851–854. [Google Scholar]
- Senk, C.; Dotzler, F. Biometric authentication as a service for enterprise identity management deployment: A data protection perspective. In Proceedings of the 2011 Sixth International Conference on Availability, Reliability and Security, Vienna, Austria, 22–26 August 2011; pp. 43–50. [Google Scholar]
- Memon, Q.; AlKassim, Z.; AlHassan, E.; Omer, M.; Alsiddig, M. Audio-visual biometric authentication for secured access into personal devices. In Proceedings of the 6th International Conference on Bioinformatics and Biomedical Science, Singapore, 22–24 June 2017; pp. 85–89. [Google Scholar]
- Khan, A.; Malik, K.M.; Ryan, J.; Saravanan, M. Battling voice spoofing: A review, comparative analysis, and generalizability evaluation of state-of-the-art voice spoofing counter measures. Artif. Intell. Rev. 2023, 56, 513–566. [Google Scholar] [CrossRef]
- Shaheed, K.; Szczuko, P.; Kumar, M.; Qureshi, I.; Abbas, Q.; Ullah, I. Deep learning techniques for biometric security: A systematic review of presentation attack detection systems. Eng. Appl. Artif. Intell. 2024, 129, 107569. [Google Scholar] [CrossRef]
- Yamagishi, J.; Kinnunen, T.H.; Evans, N.; De Leon, P.; Trancoso, I. ASVspoof: The automatic speaker verification spoofing and countermeasures challenge. IEEE J. Sel. Top. Signal Process. 2017, 11, 588–604. [Google Scholar] [CrossRef]
- Godino-Llorente, J.I.; Gómez-Vilda, P. Automatic detection of voice impairments by means of short-term cepstral parameters and neural network based detectors. IEEE Trans. Biomed. Eng. 2004, 51, 380–384. [Google Scholar] [CrossRef]
- Contreras, R.C.; Viana, M.S.; Guido, R.C. An Experimental Analysis on Mapping Strategies for Cepstral Coefficients Multi-projection in Voice Spoofing Detection Problem. In Proceedings of the Artificial Intelligence and Soft Computing—22nd International Conference, ICAISC 2023, Zakopane, Poland, 18–22 June 2023; Proceedings, Part II; Lecture Notes in Computer Science. Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M., Eds.; Springer: Berlin/Heidelberg, Germany, 2023; Volume 14126, pp. 291–306. [Google Scholar] [CrossRef]
- Bellman, R. Adaptive Control Processes: A Guided Tour; Princeton University Press: Princeton, NJ, USA, 1961. [Google Scholar]
- Beyer, K.; Goldstein, J.; Ramakrishnan, R.; Shaft, U. When Is “Nearest Neighbor” Meaningful? In Proceedings of the 7th International Conference on Database Theory, Jerusalem, Israel, 10–12 January 1999.
- Gorban, A.N.; Makarov, V.A.; Tyukin, I.Y. High-dimensional brain in a high-dimensional world: Blessing of dimensionality. Entropy 2020, 22, 82. [Google Scholar] [CrossRef]
- Contreras, R.C.; Campanharo, A.F.; Viana, M.S.; Bongarti, M.A.d.S.; Guido, R.C. Dimensionality Reduction in Multicepstral Features for Voice Spoofing Detection: Case Studies with Singular Value Decomposition, Genetic Algorithm, and Auto-Encoder. In Proceedings of the International Conference on Artificial Intelligence and Soft Computing, Zakopane, Poland, 22–26 June 2025; pp. 227–244. [Google Scholar]
- Wall, M.E.; Rechtsteiner, A.; Rocha, L.M. Singular value decomposition and principal component analysis. In A Practical Approach to Microarray Data Analysis; Springer: Berlin/Heidelberg, Germany, 2003; pp. 91–109. [Google Scholar]
- Wu, Z.; Evans, N.; Kinnunen, T.; Yamagishi, J.; Alegre, F.; Li, H. Spoofing and countermeasures for speaker verification: A survey. Speech Commun. 2015, 66, 130–153. [Google Scholar] [CrossRef]
- Wu, Z.; Kinnunen, T.; Evans, N.; Yamagishi, J.; Hanilçi, C.; Sahidullah, M.; Sizov, A. ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Kinnunen, T.; Sahidullah, M.; Delgado, H.; Todisco, M.; Evans, N.; Yamagishi, J.; Lee, K.A. The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection. In Proceedings of the Interspeech 2017; The International Speech Communication Association (ISCA): Stockholm, Sweden, 2017. [Google Scholar]
- Yamagishi, J.; Todisco, M.; Sahidullah, M.; Delgado, H.; Wang, X.; Evans, N.; Kinnunen, T.; Lee, K.A.; Vestman, V.; Nautsch, A. ASVspoof 2019: The 3rd Automatic Speaker Verification Spoofing and Countermeasures Challenge Database [sound]; University of Edinburgh, The Centre for Speech Technology Research (CSTR): Edinburgh, UK, 2019. [Google Scholar] [CrossRef]
- Yamagishi, J.; Wang, X.; Todisco, M.; Sahidullah, M.; Patino, J.; Nautsch, A.; Liu, X.; Lee, K.A.; Kinnunen, T.; Evans, N.; et al. ASVspoof 2021: Accelerating progress in spoofed and deepfake speech detection. arXiv 2021, arXiv:2109.00537. [Google Scholar] [CrossRef]
- Patel, T.B.; Patil, H.A. Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Sahidullah, M.; Kinnunen, T.; Hanilçi, C. A Comparison of Features for Synthetic Speech Detection. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Todisco, M.; Delgado, H.; Evans, N. Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Comput. Speech Lang. 2017, 45, 516–535. [Google Scholar] [CrossRef]
- Yang, J.; Das, R.K. Low frequency frame-wise normalization over constant-Q transform for playback speech detection. Digit. Signal Process. 2019, 89, 30–39. [Google Scholar] [CrossRef]
- Hanilçi, C. Data selection for i-vector based automatic speaker verification anti-spoofing. Digit. Signal Process. 2018, 72, 171–180. [Google Scholar] [CrossRef]
- Qian, Y.; Chen, N.; Yu, K. Deep features for automatic spoofing detection. Speech Commun. 2016, 85, 43–52. [Google Scholar] [CrossRef]
- Huang, L.; Pun, C.M. Audio Replay Spoof Attack Detection Using Segment-based Hybrid Feature and DenseNet-LSTM Network. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 2567–2571. [Google Scholar]
- Gomez-Alanis, A.; Peinado, A.M.; Gonzalez, J.A.; Gomez, A.M. A Gated Recurrent Convolutional Neural Network for Robust Spoofing Detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1985–1999. [Google Scholar] [CrossRef]
- Himawan, I.; Madikeri, S.; Motlicek, P.; Cernak, M.; Sridharan, S.; Fookes, C. Voice Presentation Attack Detection Using Convolutional Neural Networks. In Handbook of Biometric Anti-Spoofing; Springer: Berlin/Heidelberg, Germany, 2019; pp. 391–415. [Google Scholar]
- Contreras, R.C.; Viana, M.S.; Fonseca, E.S.; dos Santos, F.L.; Zanin, R.B.; Guido, R.C. An Experimental Analysis on Multicepstral Projection Representation Strategies for Dysphonia Detection. Sensors 2023, 23, 5196. [Google Scholar] [CrossRef] [PubMed]
- Contreras, R.C.; Heck, G.L.; Viana, M.S.; dos Santos Bongarti, M.A.; Zamani, H.; Guido, R.C. Metaheuristic Algorithms for Enhancing Multicepstral Representation in Voice Spoofing Detection: An Experimental Approach. In Proceedings of the International Conference on Swarm Intelligence, Konstanz, Germany, 9–11 October 2024; pp. 247–262. [Google Scholar]
- Himawan, I.; Villavicencio, F.; Sridharan, S.; Fookes, C. Deep Domain Adaptation for Anti-spoofing in Speaker Verification Systems. Comput. Speech Lang. 2019, 58, 377–402. [Google Scholar] [CrossRef]
- Guido, R.C. A Tutorial on Signal Energy and its Applications. Neurocomputing 2016, 179, 264–282. [Google Scholar] [CrossRef]
- Guido, R.C. A Tutorial-review on Entropy-based Handcrafted Feature Extraction for Information Fusion. Inf. Fusion 2018, 41, 161–175. [Google Scholar] [CrossRef]
- Guido, R.C. ZCR-aided Neurocomputing: A study with applications. Knowl.-Based Syst. 2016, 105, 248–269. [Google Scholar] [CrossRef]
- Guido, R.C. Enhancing Teager Energy Operator Based on a Novel and Appealing Concept: Signal mass. J. Frankl. Inst. 2018, 356, 1341–1354. [Google Scholar] [CrossRef]
- Prabakaran, D.; Shyamala, R. A review on performance of voice feature extraction techniques. In Proceedings of the 2019 3rd International Conference on Computing and Communications Technologies (ICCCT), Chennai, India, 21–22 February 2019; pp. 221–231. [Google Scholar]
- Alim, S.A.; Rashid, N.K.A. Some Commonly Used Speech Feature Extraction Algorithms; IntechOpen: London, UK, 2018. [Google Scholar]
- Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
- Hermansky, H. Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 1990, 87, 1738–1752. [Google Scholar] [CrossRef] [PubMed]
- Ladefoged, P.; Johnson, K. A Course in Phonetics; Cengage Learning: Boston, MA, USA, 2014. [Google Scholar]
- Teixeira, J.P.; Fernandes, P.O. Jitter, shimmer and HNR classification within gender, tones and vowels in healthy voices. Procedia Technol. 2014, 16, 1228–1237. [Google Scholar] [CrossRef]
- Yang, S.; Zheng, F.; Luo, X.; Cai, S.; Wu, Y.; Liu, K.; Wu, M.; Chen, J.; Krishnan, S. Effective dysphonia detection using feature dimension reduction and kernel density estimation for patients with Parkinson’s disease. PLoS ONE 2014, 9, e88825. [Google Scholar] [CrossRef]
- Puts, D.A.; Apicella, C.L.; Cárdenas, R.A. Masculine voices signal men’s threat potential in forager and industrial societies. Proc. R. Soc. B Biol. Sci. 2012, 279, 601–609. [Google Scholar] [CrossRef]
- Pisanski, K.; Rendall, D. The prioritization of voice fundamental frequency or formants in listeners’ assessments of speaker size, masculinity, and attractiveness. J. Acoust. Soc. Am. 2011, 129, 2201–2212. [Google Scholar] [CrossRef] [PubMed]
- Little, M.; McSharry, P.; Hunter, E.; Spielman, J.; Ramig, L. Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease. IEEE Trans. Biomed. Eng. 2009, 56, 1015. [Google Scholar] [CrossRef] [PubMed]
- Reby, D.; McComb, K. Anatomical constraints generate honesty: Acoustic cues to age and weight in the roars of red deer stags. Anim. Behav. 2003, 65, 519–530. [Google Scholar] [CrossRef]
- Fitch, W.T. Vocal tract length and formant frequency dispersion correlate with body size in rhesus macaques. J. Acoust. Soc. Am. 1997, 102, 1213–1222. [Google Scholar] [CrossRef]
- Valero, X.; Alias, F. Gammatone cepstral coefficients: Biologically inspired features for non-speech audio classification. IEEE Trans. Multimed. 2012, 14, 1684–1689. [Google Scholar] [CrossRef]
- Herrera, A.; Del Rio, F. Frequency bark cepstral coefficients extraction for speech analysis by synthesis. J. Acoust. Soc. Am. 2010, 128, 2290. [Google Scholar] [CrossRef]
- Rao, K.S.; Reddy, V.R.; Maity, S. Language Identification Using Spectral and Prosodic Features; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
- Zouhir, Y.; Ouni, K. Feature Extraction Method for Improving Speech Recognition in Noisy Environments. J. Comput. Sci. 2016, 12, 56–61. [Google Scholar] [CrossRef]
- Jia, W.; Sun, M.; Lian, J.; Hou, S. Feature dimensionality reduction: A review. Complex Intell. Syst. 2022, 8, 2663–2693. [Google Scholar] [CrossRef]
- Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
- Van Der Maaten, L.; Postma, E.O.; Van Den Herik, H.J. Dimensionality reduction: A comparative review. J. Mach. Learn. Res. 2009, 10, 13. [Google Scholar]
- Noble, W.S. What is a support vector machine? Nat. Biotechnol. 2006, 24, 1565–1567. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Malek, A.; Titeux, H.; Borzì, S.; Nielsen, C.H.; Stöter, F.R.; Bredin, H.; Moerman, K.M. SuperKogito/spafe: V0.3.2 [software]; Zenodo: Geneva, Switzerland, 2023. [Google Scholar] [CrossRef]
- Jadoul, Y.; Thompson, B.; De Boer, B. Introducing parselmouth: A python interface to praat. J. Phon. 2018, 71, 1–15. [Google Scholar] [CrossRef]
- Delgado, H.; Todisco, M.; Sahidullah, M.; Evans, N.; Kinnunen, T.; Lee, K.A.; Yamagishi, J. ASVspoof 2017 Version 2.0: Meta-data analysis and baseline enhancements. In Proceedings of the Speaker and Language Recognition Workshop, ISCA, Les Sables d’Olonne, France, 26–29 June 2018; pp. 296–303. [Google Scholar]
- Jelil, S.; Sinha, R.; Prasanna, S.M. Spectro-Temporally Compressed Source Features for Replay Attack Detection. IEEE Signal Process. Lett. 2024, 31, 721–725. [Google Scholar] [CrossRef]
- Chutia, B.J.; Bhattacharjee, U. Effectiveness of different Spectral Features in Replay Attack Detection-An Experimental Study. Grenze Int. J. Eng. Technol. (GIJET) 2024, 10, 1907–1913. [Google Scholar]
- Meriem, F.; Messaoud, B.; Bahia, Y.z. Texture analysis of edge mapped audio spectrogram for spoofing attack detection. Multimed. Tools Appl. 2024, 83, 15915–15937. [Google Scholar] [CrossRef]
- Neamtu, C.T.; Mihalache, S.; Burileanu, D. Liveness Detection–Automatic Classification of Spontaneous and Pre-recorded Speech for Biometric Applications. In Proceedings of the 2023 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Bucharest, Romania, 25–27 October 2023; pp. 1–5. [Google Scholar]
- Li, W.; Xia, M. Discriminative Feature Extraction Based on SWM for Playback Attack Detection. In Proceedings of the 2023 IEEE World Conference on Applied Intelligence and Computing (AIC), Sonbhadra, India, 29–30 July 2023; pp. 763–767. [Google Scholar]
- Chettri, B. The clever hans effect in voice spoofing detection. In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 9–12 January 2023; pp. 577–584. [Google Scholar]
- Gupta, P.; Chodingala, P.K.; Patil, H.A. Replay spoof detection using energy separation based instantaneous frequency estimation from quadrature and in-phase components. Comput. Speech Lang. 2023, 77, 101423. [Google Scholar] [CrossRef]
- Xu, L.; Yang, J.; You, C.H.; Qian, X.; Huang, D. Device features based on linear transformation with parallel training data for replay speech detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 1574–1586. [Google Scholar] [CrossRef]
- Gupta, P.; Chodingala, P.K.; Patil, H.A. Energy separation based instantaneous frequency estimation from quadrature and in-phase components for replay spoof detection. In Proceedings of the 2022 30th European Signal Processing Conference (EUSIPCO), Belgrade, Serbia, 29 August–2 September 2022; pp. 369–373. [Google Scholar]
- Woubie, A.; Bäckström, T. Voice quality features for replay attack detection. In Proceedings of the 2022 30th European Signal Processing Conference (EUSIPCO), Belgrade, Serbia, 29 August–2 September 2022; pp. 384–388. [Google Scholar]
- Bharath, K.; Kumar, M.R. Replay spoof detection for speaker verification system using magnitude-phase-instantaneous frequency and energy features. Multimed. Tools Appl. 2022, 81, 39343–39366. [Google Scholar] [CrossRef]
- Süslü, Ç.; Eren, E.; Demiroğlu, C. Uncertainty assessment for detection of spoofing attacks to speaker verification systems using a Bayesian approach. Speech Commun. 2022, 137, 44–51. [Google Scholar] [CrossRef]
- Patil, H.A.; Acharya, R.; Patil, A.T.; Gupta, P. Non-cepstral uncertainty vector for replay spoofed speech detection. In Proceedings of the 2022 30th European Signal Processing Conference (EUSIPCO), Belgrade, Serbia, 29 August–2 September 2022; pp. 374–378. [Google Scholar]
- Kamble, M.R.; Patil, H.A. Detection of replay spoof speech using teager energy feature cues. Comput. Speech Lang. 2021, 65, 101140. [Google Scholar] [CrossRef]
- Liu, M.; Wang, L.; Lee, K.A.; Chen, X.; Dang, J. Replay-attack detection using features with adaptive spectro-temporal resolution. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtually, 6–11 June 2021; pp. 6374–6378. [Google Scholar]
- Jana, S.; Yashwanth, V.S.; Dheeraj, K.; Balaji, S.; Bharath, K.; Kumar, M.R. Replay Attack Detection for Speaker Verification Using Different Features Level Fusion System. In Proceedings of the 2021 Innovations in Power and Advanced Computing Technologies (i-PACT), Kuala Lumpur, Malaysia, 27–29 November 2021; pp. 1–5. [Google Scholar]
- Liu, M.; Wang, L.; Dang, J.; Lee, K.A.; Nakagawa, S. Replay attack detection using variable-frequency resolution phase and magnitude features. Comput. Speech Lang. 2021, 66, 101161. [Google Scholar] [CrossRef]
- Avila, A.R.; Alam, J.; Prado, F.O.C.; O’Shaughnessy, D.; Falk, T.H. On the use of blind channel response estimation and a residual neural network to detect physical access attacks to speaker verification systems. Comput. Speech Lang. 2021, 66, 101163. [Google Scholar] [CrossRef]
- Chettri, B.; Benetos, E.; Sturm, B.L. Dataset Artefacts in anti-spoofing systems: A case study on the ASVspoof 2017 benchmark. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 3018–3028. [Google Scholar] [CrossRef]
- Li, J.; Sun, M.; Zhang, X.; Wang, Y. Joint decision of anti-spoofing and automatic speaker verification by multi-task learning with contrastive loss. IEEE Access 2020, 8, 7907–7915. [Google Scholar] [CrossRef]
- Das, R.K.; Yang, J.; Li, H. Assessing the scope of generalized countermeasures for anti-spoofing. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtually, 4–8 May 2020; pp. 6589–6593. [Google Scholar]
- Kamble, M.R.; Tak, H.; Patil, H.A. Amplitude and frequency modulation-based features for detection of replay spoof speech. Speech Commun. 2020, 125, 114–127. [Google Scholar] [CrossRef]
- Phapatanaburi, K.; Buayai, P.; Naktong, W.; Srinonchat, J. Exploiting magnitude and phase aware deep neural network for replay attack detection. ECTI Trans. Electr. Eng. Electron. Commun. 2020, 18, 89–97. [Google Scholar] [CrossRef]
- Kamble, M.R.; Patil, H.A. Combination of amplitude and frequency modulation features for presentation attack detection. J. Signal Process. Syst. 2020, 92, 777–791. [Google Scholar] [CrossRef]
- Patil, A.T.; Patil, H.A. Significance of cmvn for replay spoof detection. In Proceedings of the 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, 7–10 December 2020; pp. 532–537. [Google Scholar]
- Yang, J.; Das, R.K. Improving anti-spoofing with octave spectrum and short-term spectral statistics information. Appl. Acoust. 2020, 157, 107017. [Google Scholar] [CrossRef]
- Yang, J.; Xu, L.; Ren, B.; Ji, Y. Discriminative features based on modified log magnitude spectrum for playback speech detection. EURASIP J. Audio Speech Music Process. 2020, 2020, 6. [Google Scholar] [CrossRef]
- Lin, L.; Wang, R.; Yan, D.; Dong, L. A robust method for speech replay attack detection. KSII Trans. Internet Inf. Syst. (TIIS) 2020, 14, 168–182. [Google Scholar]
- Chettri, B.; Kinnunen, T.; Benetos, E. Deep generative variational autoencoding for replay spoof detection in automatic speaker verification. Comput. Speech Lang. 2020, 63, 101092. [Google Scholar] [CrossRef]
- Kamble, M.R.; Patil, H.A. Novel variable length teager energy profiles for replay spoof detection. Energy 2020, 32, 33. [Google Scholar]
- Tapkir, P.A.; Kamble, M.R.; Patil, H.A.; Madhavi, M. Replay spoof detection using power function based features. In Proceedings of the 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA, 12–15 November 2018; pp. 1019–1023. [Google Scholar]
- Kinnunen, T.; Sahidullah, M.; Delgado, H.; Todisco, M.; Evans, N.; Yamagishi, J.; Lee, K.A. The 2nd Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2017) Database, Version 2 [Sound]; University of Edinburgh, The Centre for Speech Technology Research (CSTR): Edinburgh, UK, 2018. [Google Scholar] [CrossRef]
Authors | Technique/Approach | Dataset | Best EER | Main Contributions |
---|---|---|---|---|
Patel and Patil [19] | GMM with CFCC + IF | ASVspoof 2015 | 0.4079% (known) | Simulation of human auditory processing using cochlear filters and instantaneous frequency |
Sahidullah et al. [20] | Comparison of 19 feature sets + GMM/SVM | ASVspoof 2015 | 0.07% (known) | Comprehensive evaluation of spectral, phase, and long-term features |
Todisco et al. [21] | Constant-Q Transform | ASVspoof 2015 | 0.048% (known) | Adaptation of music signal processing technique for spoofing detection |
Yang and Das [22] | CQT + low-frequency frame normalization + SVM/ResNet | ASVspoof 2017 | 10.31% | Enhanced robustness to noise and environmental variability |
Hanilçi [23] | i-vectors + GMM | ASVspoof 2015 | 2.39% | EER reduction by combining traditional modeling with i-vectors |
Qian et al. [24] | DNN, AE, LSTM-RNN | ASVspoof 2015 | ≈0% (known) | Comparison of five deep neural architectures |
Huang and Pun [25] | DenseNet + LSTM with MFCC + CQCC | ASVspoof 2017 | 7.34% | Fusion of two deep networks with multicepstral features |
Gomez-Alanis et al. [26] | Gated Recurrent CNN | ASVspoof 2015/2017/2019 | 0% | Improved robustness to real-world noise |
Himawan et al. [30] | AlexNet (transfer learning) + SVM | ASVspoof 2017 | 11.21% | Use of pre-trained CNN to mitigate overfitting and extract embeddings |
Contreras et al. [8] | Fusion of MFCC, CQCC, BFCC, and LFCC with multiple projections | ASVspoof 2017 | 15.02% | Introduction of multicepstral-enriched vector mappings |
Contreras et al. [29] | GA, PSO, GWO for selection in multicepstral spaces | ASVspoof 2017 | 17.42% | Metaheuristic optimization in multicepstral representations |
Contreras et al. [12] | SVD, AE, GA for dimensionality reduction | ASVspoof 2017 | 13.41% | Dimensionality reduction in multicepstral spaces with EER improvement |
ID | Projection Strategies |
---|---|
Proj_1 | {PCA} |
Proj_2 | {MEAN} |
Proj_3 | {STD} |
Proj_4 | {SKEW} |
Proj_5 | {PCA, MEAN} |
Proj_6 | {MEAN, STD} |
Proj_7 | {STD, SKEW} |
Proj_8 | {MEAN, STD, SKEW} |
Proj_9 | {PCA, MEAN, STD, SKEW} |
Set | CC Extraction Technique(s) |
---|---|
MFCC | |
CQCC | |
iMFCC | |
BFCC | |
LFCC | |
LPCC | |
GFCC | |
NGCC | |
CQCC, MFCC | |
CQCC, LFCC | |
CQCC, LPCC | |
CQCC, GFCC | |
CQCC, NGCC | |
CQCC, MFCC, LFCC | |
CQCC, MFCC, BFCC | |
CQCC, MFCC, GFCC | |
CQCC, MFCC, NGCC | |
CQCC, MFCC, BFCC, NGCC | |
CQCC, MFCC, BFCC, LFCC | |
CQCC, MFCC, BFCC, GFCC |
Subset | Speakers | Sessions | Settings | Genuine | Spoofing |
---|---|---|---|---|---|
Training | 10 | 6 | 3 | 1507 | 1507 |
Development | 8 | 10 | 10 | 760 | 950 |
Evaluation | 24 | 161 | 57 | 1298 | 12,008 |
Total | 42 | 177 | 70 | 3565 | 14,465 |
Set | Normalized | Noncepstral | Accuracy (max) | EER (min) |
---|---|---|---|---|
Dev | No | No | 87.57 | 10.32 |
Yes | No | 86.58 | 10.37 | |
No | Yes | 80.56 | 11.53 | |
Yes | Yes | 79.42 | 12.05 | |
Eval | No | No | 83.31 | 11.07 |
Yes | No | 83.31 | 11.07 | |
No | Yes | 85.58 | 10.64 | |
Yes | Yes | 85.58 | 10.64 |
Scenario | Technique | EER Mean | Standard Deviation | EER Min |
---|---|---|---|---|
Dev + Std | LASSO | 22.31 | 6.39 | 10.89 |
PCA | 23.67 | 5.78 | 13.21 | |
SVD | 24.57 | 6.66 | 11.63 | |
RFE | 24.79 | 6.55 | 10.37 | |
PIS | 27.99 | 9.41 | 10.53 | |
RF | 28.69 | 7.46 | 10.79 | |
MI | 31.12 | 8.18 | 13.63 | |
ANOVA | 31.27 | 7.76 | 13.74 | |
Dev + Unscaled | LASSO | 22.31 | 6.39 | 10.89 |
PCA | 22.83 | 6.76 | 10.32 | |
SVD | 23.74 | 7.04 | 10.63 | |
RFE | 24.80 | 6.56 | 10.37 | |
PIS | 27.99 | 9.41 | 10.53 | |
RF | 28.63 | 7.34 | 10.58 | |
MI | 31.13 | 8.17 | 13.63 | |
ANOVA | 31.27 | 7.76 | 13.74 | |
Eval + Std | LASSO | 21.08 | 5.94 | 11.35 |
PIS | 22.05 | 6.37 | 10.64 | |
RFE | 22.79 | 6.02 | 12.76 | |
PCA | 23.58 | 5.90 | 13.39 | |
SVD | 24.03 | 5.59 | 13.82 | |
RF | 25.86 | 4.67 | 15.90 | |
MI | 28.84 | 4.90 | 18.10 | |
ANOVA | 29.71 | 5.12 | 18.15 | |
Eval + Unscaled | LASSO | 21.08 | 5.94 | 11.35 |
PIS | 22.05 | 6.36 | 10.64 | |
PCA | 22.39 | 5.98 | 12.64 | |
RFE | 22.79 | 6.02 | 12.76 | |
SVD | 23.49 | 5.46 | 13.80 | |
RF | 25.97 | 4.76 | 16.76 | |
MI | 28.83 | 4.91 | 17.30 | |
ANOVA | 29.71 | 5.12 | 18.15 |
Algorithm | EER (%) | ||
---|---|---|---|
Dev | Eval | Mean | |
Proposed Method | 10.32 | 10.64 | 10.48 |
2D-ILRCC [59] | 11.90 | 10.87 | 11.38 |
iMFCC/SCMC [60] | 4.83 | 11.49 | 8.16 |
OMSp-HASC-Canny [61] | 3.52 | 22.86 | 13.19 |
CQCC, MFCC (PCA) [29] | - | 17.42 | 17.42 |
CQCC, MFCC, LFCC [28] | - | 15.02 | 15.02 |
LFCC + CQCC + LTAS [62] | 8.60 | 26.10 | 17.35 |
SWM-CQCC-DA [63] | - | 10.27 | 10.27 |
LCNN/smallCNN [64] | 7.37 | 10.70 | 9.03 |
CQCC + CFCC + CFCCIF + CFCCIF-ESA/QESA [65] | 1.88 | 10.99 | 6.43 |
LFDCC/CDOC [66] | 13.03 | 11.63 | 12.33 |
CQCC + QCFCCIF-ESA [67] | 2.19 | 11.00 | 6.59 |
CQCC + JS [68] | 7.60 | 10.90 | 9.25 |
EMD-HS-RFCC + CQCC + APGDF [69] | 4.89 | 10.84 | 7.86 |
ResNet/LCNN [70] | 7.85 | 8.20 | 8.02 |
CQCC + LFCC + MFCC + u-vector [71] | 6.66 | 11.99 | 9.32 |
TECC [72] | 10.80 | 11.41 | 11.10 |
CQCC + AT-IMFCC + AT-MelRP [73] | 2.08 | 10.75 | 6.41 |
(LPCC + LFCC)-GMM [74] | 5.26 | 22.65 | 13.95 |
CQCCs + AFCCsFAF + ARPDBF [75] | 1.72 | 11.22 | 6.47 |
Multicepstral features + ResNet/GMM [76] | 7.57 | 11.64 | 9.60 |
CQCC-CNN [77] | 7.37 | 10.70 | 9.03 |
TDNN + 2PLDA − CQCC + MFCC [78] | - | 11.16 | 11.16 |
CQSPIC [79] | - | 11.34 | 11.34 |
ESA-IACC-IFCC [80] | 7.03 | 10.12 | 8.57 |
GMM + CQCC- DNN + (MFCC + MGDCC) [81] | 5.86 | 24.14 | 15.00 |
CQCC (90-D (SDA)) − AWFCC [82] | 5.75 | 10.42 | 8.08 |
LFCC [83] | 7.02 | 14.83 | 10.92 |
(eCQCC-STSSI)-DA [84] | - | 10.07 | 10.07 |
CVOC-DA [85] | - | 11.46 | 11.46 |
SFCC-QCN [86] | 8.38 | 10.11 | 9.24 |
Spectrogram-CNN [87] | 10.82 | 16.03 | 13.42 |
CQCC + VTECC [88] | 5.85 | 10.94 | 8.39 |
CQNCC-D DNN [22] | - | 10.31 | 10.31 |
CQCC + PNCC [89] | 8.73 | 12.98 | 10.85 |
CQCC-CMVN [58] | 9.06 | 13.74 | 11.40 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Souza, L.M.d.; Guido, R.C.; Contreras, R.C.; Viana, M.S.; Bongarti, M.A.d.S. Improving Voice Spoofing Detection Through Extensive Analysis of Multicepstral Feature Reduction. Sensors 2025, 25, 4821. https://doi.org/10.3390/s25154821
Souza LMd, Guido RC, Contreras RC, Viana MS, Bongarti MAdS. Improving Voice Spoofing Detection Through Extensive Analysis of Multicepstral Feature Reduction. Sensors. 2025; 25(15):4821. https://doi.org/10.3390/s25154821
Chicago/Turabian StyleSouza, Leonardo Mendes de, Rodrigo Capobianco Guido, Rodrigo Colnago Contreras, Monique Simplicio Viana, and Marcelo Adriano dos Santos Bongarti. 2025. "Improving Voice Spoofing Detection Through Extensive Analysis of Multicepstral Feature Reduction" Sensors 25, no. 15: 4821. https://doi.org/10.3390/s25154821
APA StyleSouza, L. M. d., Guido, R. C., Contreras, R. C., Viana, M. S., & Bongarti, M. A. d. S. (2025). Improving Voice Spoofing Detection Through Extensive Analysis of Multicepstral Feature Reduction. Sensors, 25(15), 4821. https://doi.org/10.3390/s25154821