Generalized Replay Spooﬁng Countermeasure Based on Combining Local Subclassiﬁcation Models

: Automatic speaker veriﬁcation (ASV) systems play a prominent role in the security ﬁeld due to the usability of voice biometrics compared to alternative biometric authentication modalities. Nevertheless, ASV systems are susceptible to malicious voice spooﬁng attacks. In response to such threats, countermeasures have been devised to prevent breaches and ensure the safety of user data by categorizing utterances as either genuine or spoofed. In this paper, we propose a new voice spooﬁng countermeasure that seeks to improve the generalization of supervised learning models. This is accomplished by alleviating the problem of intraclass variance. Speciﬁcally, the proposed approach addresses the generalization challenge by splitting the classiﬁcation problem into a set of local subproblems in order to lessen the supervised learning task. The system outperformed existing state-of-the-art approaches with an EER of 0.097% on the ASVspoof challenge corpora related to replaying spooﬁng attacks.


Introduction
In order to protect the applications and stored data, biometric authentication is currently widely used along with other identification modalities to supervise and control system accessibility [1]. Speaker verification (SV) systems exploit speech modalities to identify the user seeking to gain access to systems or services. Specifically, human voiceprint authentication is performed by comparing the voice of the speaker to previously recorded voiceprints. The growing popularity of voice-activated smart home systems has increased the prominence of automatic speaker verification (ASV) technology as a security measure for such devices. These ASV systems also benefit other services such as phone banking and online payment processing. Nevertheless, serious security concerns constrain the potential of these systems. Indeed, spoofing attacks pose a threat to ASV systems [2]. Both the International Electrotechnical Commission (IEC) and the International Organization for Standardization (ISO) have defined such intrusions as presentation attacks (Pas) [3]. These attacks are conducted by criminals impersonating an authenticated user to attempt to gain access to private information [1], and are performed through the use of speech synthesis (SS), replay attacks, and data voice conversion (VC) techniques [1,2]. Among these techniques, replay attacks are the most common since they do not require substantial technological knowledge. Furthermore, it is difficult to detect such attacks due to the simplicity of the technique, which consists of collecting voice samples and then replaying them. To block these spoofing attacks, it is necessary to devise antispoofing countermeasures. This approach consists of using a classification system that distinguishes between genuine and spoofed utterances.
Classifying voice utterances as genuine or spoofed typically entails both performing suitable feature extraction and applying classification technique on these features. In this

Background
Both unsupervised and supervised learning paradigms are involved in the design of the proposed approach. Indeed, unsupervised learning, specifically clustering, is utilized to learn the underlying structure of the data and split it automatically into homogeneous subgroups. Alternatively, supervised learning, particularly classification, is exploited to devise a set of countermeasures suitable for each subgroup.

Clustering
Clustering uses a specific measure to group similar utterances to the same cluster and dissimilar ones to different ones. This method entails three main techniques. The first technique is hierarchical clustering, which involves establishing a hierarchal structure of the clusters by adopting either a top-down approach (known as divisive) or a bottom-up approach (known as agglomerative). The second technique is partitioning clustering (also known as centroid-based clustering). This method learns a representative instance from each cluster (e.g., the cluster centers) and assigns the instances to the closest representative. The third technique is density-based clustering [12], which assigns instances to each cluster on the basis of density. More specifically, clusters are formed of dense instances, while sparse instances are categorized as outliers.
These clustering techniques can be either crisp or fuzzy. The former involves assigning instances to only one cluster, whereas the latter uses a membership degree to assign instances to multiple clusters on the basis of their probability to belong to each cluster. This allows for fuzzy clustering to be applied to real-word problems with overlapping cluster boundaries [6]. Three fuzzy clustering processes are outlined in the next section: competitive agglomeration CA [13] algorithms, fuzzy C-means (FCM) clustering [14], and simultaneous clustering and attribute discrimination (SCAD) [15].

Fuzzy C-Means
By minimizing intracluster distances, fuzzy C-means (FCM) [14] conducts the fuzzy partitioning of unlabeled data. To be specific, if x j represents a set of instances; by minimizing the objective function defined in (1) subject to (2), the cluster representatives (centers), c i , and fuzzy memberships, (µ ij ) are derived. Both c i and (µ ij ) are then learned alternatively through iterative learning as follows.
subject to µ ij ∈ [0, 1] ∀i, j; and In (2), d denotes the dimension of the vectors, C denotes the number of clusters, m is parameter controlling the membership fuzziness, and N is the number of utternaces, x j and c i ∈ R d . FCM is fast and robust with a time complexity of O(N).

Simultaneous Clustering and Attribute Discrimination
Feature selection and aggregation can be performed using an extension of FCM called simultaneous clustering and attribute discrimination (SCAD) [15]. For each cluster, this process learns relevant feature weights, , and fuzzy memberships, by minimizing the following objective function in (3): and v ik ∈ [0, 1] ∀i, k; and where N denotes the number of utterances; C denotes the cluster number; d denotes the size of the feature, with v ik , c ik , and u iji ∈ R d . Since SCAD is based on fuzzy C-means, it is fast and robust. It has also the same time complexity as that of O(N).

Competitive Agglomeration
Another extension of FCM is competitive agglomeration (CA) [13]. CA handles the challenge of determining the cluster number. This technique fuses hierarchical and partitioning processes to utilize the advantages of both in order to learn the number of clusters, cluster representatives, and fuzzy memberships. CA learns the optimal cluster number by splitting the utterances into tiny groups that subsequently compete over instances in the optimization process. Consequently, empty clusters slowly vanish. CA is achieved through the optimization of the objective function in (6): where the cluster representatives are B = (β 1 , . . . , β c ), and the distance between the feature vector x j and prototype β i is d 2 ij x j , β i . u ij represents the degree of belongness of utterance j to the partition i.
The cost defined in (6) contains two parts. The left term represents the FCM clustering technique as defined in (1) responsible for the fuzzy portioning, while the right term expresses the competition between instances to be enclosed in cluster competition. Similarly, CA is based on fuzzy C-means. Therefore, it is a time complexity of O(N).

Classification
Classification is a supervised learning technique where a model is built using labeled data instances in order to predict the class value for unseen instances [16]. Specifically, the model learns how to map input instances to the predefined classes. Thus, for the learned model to be effective, the set of training data should be representative and sufficiently available. A problem that requires classification can be a binary classification problem or multiclass classification problem. For binary classification, only two classes are considered, while for multi class classification, more than two classes are considered. Another way of categorizing the classification problem is as linear or nonlinear. Linear classifiers employ linear models for class prediction. Alternatively, nonlinear classifiers learn nonlinear models [6]. In the literature, various classifications algorithms have been proposed. However, there is no way to know which classification model is more suitable for a certain problem. As such, the choice of the classifier is generally empirically performed. In the following, we outline the classification approaches that are exploited in the design of the proposed approach: the Gaussian mixture model (GMM) classifier [6], support vector machine SVM) [17], and extreme gradient boosting (XGBoost) [18].

Gaussian Mixture Model
The Gaussian mixture model (GMM) classifier [6] learns a probabilistic model. The latter estimates an instance as a mixture of weighted Gaussians. More specifically, on the basis of the probability density functions of an input instance with respect to each class, this classifier predicts the class of each instance using Bayes' rule [19]. The mean, standard deviation, and weight of each Gaussian involved in the mixture are estimated using the expectation maximization (EM) [20] iterative approach or maximum a posteriori (MAP) approach [21].

Support Vector Machine
Support vector machine (SVM) [17] is a binary classifier that learns a hyperplane that separates the two considered classes. The model of the hyperplane is learned in a way that ensures a maximal margin of separation between the two classes. This version of SVM, which does not allow for any instance to reside within the margin, is called hard SVM [22]. Alternatively, in order to learn a less complex hyperplane and avoid overfitting, soft SVM [23] allows few classification errors by letting certain instances from both classes to reside within the margin. Although SVM is designed to be a binary classifier, it can be employed for multiclass problems. This application involves learning a hyperplane with respect to each class. The problem then amounts to classifying each category against all other classes, with the name versus all SVMs [24]. In case the data are not linearly separable, kernel SVM [25] is more suitable. This method applies the kernel trick to transform the data by expressing it in new space of higher dimension, allowing for a better separation of the categories.

Extreme Gradient Boosting
Decision tree (DT) is a classification model which consists of nested "if/else" conditions. Alternatively, a gradient boosting decision tree (GBDT) fuses a set of DT models to achieve a better DT model. The improvement of one DT model is achieved through combination with other DT models. This process involves building a series of DT models iteratively, where each new generated model accounts for and addresses the previous model's flaws. As a result, the output consists of a weighted sum of all considered DT outputs. Similarly, extreme gradient boosting (XGBoost) [18] is a GBDT. Nevertheless, it performs parallel tree boosting rather than sequential boosting like GBDT, and checks all gradient values to assess each conceivable split of the training set.

Related Works
Audio classification seeks learning a model that is able to predict the category of unknown audio utterance [26]. This machine learning task can benefit many practical fields such as medical applications related to diagnosing sleep bruxism [27], dementia [28], and depression [29]. Moreover, industrial applications exploited audio classification techniques for several scenarios, such as detecting machine chatter [30] and the condition of rotating machines [31] Furthermore, environmental sound recognition [32,33] has contributed to the understanding of the context of the occurring audio. In fact, it is crucial to trigger decisive actions such as evacuating a building when an alarm occurs or reaching a baby when he cries.
Typically, state-of-the-art approaches consist of classifying a voice utterance as genuine or spoofed. These approaches are based on conventional classification paradigms, deep learning paradigms, or a combination of supervised and unsupervised learning paradigms.

Conventional Approaches
Conventional approaches consist of two main aspects. While the first involves extracting an audio feature suitable for discriminating genuine from spoofed utterances, the second component trains a model able to categorize the extracted features. In particular, the work in [4] employed the Cepstral coefficient (CQCC) feature [4] and employed the Gaussian mixture model (GMM) [6] as a classifier. This system is considered to be a baseline approach to assess antispoofing systems [1,2]. Similarly, the countermeasure proposed in [37] extracted a combination of cochlear filter Cepstral coefficients (CFCCs) [38] and the instantaneous frequency (IF) [39], and fed them into a GMM classifier. Alternatively, the authors in [5] proposed a countermeasure based on LFCC [40] features and a GMM classifier after comparing 19 different features coupled with SVM [17] and GMM [6,41] classifiers. On the other hand, the study in [42] combined mel-frequency Cepstral coefficient (MFCC) [43], mel-frequency principal coefficient (MFPC) [44], and CosPhase principal coefficient (CosPhasePC) [45] features and conveyed them to an SVM classifier.

Deep Learning Approaches
Due to the boost achieved by deep neural networks (DNNs) in the machine learning field, particularly in classification tasks, antispoofing approaches based on a deep learning paradigm have been proposed. For this purpose, several DL models have been exploited. More specifically, the system outlined in [46] utilizes a dilated residual network (DRN) deep learning model [47] including a ResNet [47] model with an attention filtering mechanism to discard irrelevant audio segments such as background noise. The ResNet [47] deep learning model was also utilized in the system described in [48]. Here, two low-level cepstral features, MFCCs [43] and CQCCs [4], were fed into the network instead of the raw data. A similar model was deployed in the system described in [49]. However, rather than conveying MFCCs [43] as input, this model uses high-frequency Cepstral coefficients (HFCCs). Similarly, the authors in [50] employed ResNet along with SENet [51], Mean-Std ResNet [51], and Dilated ResNet [52] to analyze CQCCs and spectrogram features. The fusion in these models is performed using the greedy fusion scheme presented in [53]. As a result, the fusion of these deep learning models was found in [50] to yield a system that outperformed reported state-of-the-art approaches when using the Asvspoof 2019 Replay Benchmark.
Recurrent neural networks (RNNs) [9] have also been exploited to design countermeasures. As such, the research in [54,55] employed long short-term memory (LSTM) [56]. The research in [57,58] exploited RNN [9] along with a convolutional neural network (CNN) [59]. In these works, CNN functioned as a feature extractor, while RNN performed long dependency processing. Similarly, the study in [60] exploited a combination of CNN and RNN. Specifically, the study combined three i-vector [61] systems, namely, the light convolutional neural network (LCNN) [62] system and the CNN + RNN one. LCNN was also employed along with a small Bayesian neural network [63] in [63,64]. In [65], the softmax function was replaced with the softpus function to estimate the deep learning model prediction uncertainty. Alternatively, a light convolutional gated recurrent neural network (LC-GRNN) was used in [66], and the authors in [67] adopted a variety of LCNNs based on context gate CNN (CGCNN), which used gated linear unit (GLU) activations as a context-gate for each filter. The adopted feature for this system was Log-CQT.

Combination of Unsupervised and Supervised Learning
Recently, a spoofing countermeasure based on mining hidden partitions of genuine and spoofed utterances using fuzzy clustering was proposed in [72]. This countermeasure partitions each class (genuine/spoofing) into subgroups such that each subgroup shared the same characteristics and thus exhibited low variance. The classification of unknown utterances was than performed by assigning them to the closest subgroup. Figure 1 depicts the spoofing countermeasure reported in [72]. First, audio features are extracted from all utterances. Then, the instances of each category (genuine/spoofing) are clustered using the fuzzy clustering approach. As such, the representatives of the genuine sub-categories and those of the spoofing one are learned. In particular, fuzzy clustering techniques are employed. The experimental results showed that two genuine clusters and two spoofing clusters dramatically increased the performance. It yielded a testing EER of 1.07% on the ASVspoof 2017 replay benchmark dataset. raw data. A similar model was deployed in the system described in [49]. However, rather than conveying MFCCs [43] as input, this model uses high-frequency Cepstral coefficients (HFCCs). Similarly, the authors in [50] employed ResNet along with SENet [51], Mean-Std ResNet [51], and Dilated ResNet [52] to analyze CQCCs and spectrogram features. The fusion in these models is performed using the greedy fusion scheme presented in [53]. As a result, the fusion of these deep learning models was found in [50] to yield a system that outperformed reported state-of-the-art approaches when using the Asvspoof 2019 Replay Benchmark.
Recurrent neural networks (RNNs) [9] have also been exploited to design countermeasures. As such, the research in [54,55] employed long short-term memory (LSTM) [56]. The research in [57] and [58] exploited RNN [9] along with a convolutional neural network (CNN) [59]. In these works, CNN functioned as a feature extractor, while RNN performed long dependency processing. Similarly, the study in [60] exploited a combination of CNN and RNN. Specifically, the study combined three i-vector [61] systems, namely, the light convolutional neural network (LCNN) [62] system and the CNN + RNN one. LCNN was also employed along with a small Bayesian neural network [63] in [63,64]. In [65], the softmax function was replaced with the softpus function to estimate the deep learning model prediction uncertainty. Alternatively, a light convolutional gated recurrent neural network (LC-GRNN) was used in [66], and the authors in [67] adopted a variety of LCNNs based on context gate CNN (CGCNN), which used gated linear unit (GLU) activations as a context-gate for each filter. The adopted feature for this system was Log-CQT.

Combination of Unsupervised and Supervised Learning
Recently, a spoofing countermeasure based on mining hidden partitions of genuine and spoofed utterances using fuzzy clustering was proposed in [72]. This countermeasure partitions each class (genuine/spoofing) into subgroups such that each subgroup shared the same characteristics and thus exhibited low variance. The classification of unknown utterances was than performed by assigning them to the closest subgroup. Figure 1 depicts the spoofing countermeasure reported in [72]. First, audio features are extracted from all utterances. Then, the instances of each category (genuine/spoofing) are clustered using the fuzzy clustering approach. As such, the representatives of the genuine sub-categories and those of the spoofing one are learned. In particular, fuzzy clustering techniques are employed. The experimental results showed that two genuine clusters and two spoofing clusters dramatically increased the performance. It yielded a testing EER of 1.07% on the ASVspoof 2017 replay benchmark dataset. An illustrative example of the countermeasure reported in [72] is depicted in Figure 2. In this example, six clusters {S 1 , S 2 , S 3 , S 4 , S 5 , S 6 } are learned for the spoofing class, and four clusters {G 1 , G 2 , G 3 , G 4 } are learned for the genuine class. Then, an unknown utterance is compared to the 10 cluster centers. Since the closest cluster is G 1 , one of the genuine clusters, the unknown utterance is classified as genuine.
An illustrative example of the countermeasure reported in [72] is depicted in Figure 2. In this example, six clusters { 1 , 2 , 3 , 4 , 5 , 6 } are learned for the spoofing class, and four clusters { 1 , 2 , 3 , 4 } are learned for the genuine class. Then, an unknown utterance is compared to the 10 cluster centers. Since the closest cluster is 1 , one of the genuine clusters, the unknown utterance is classified as genuine. Tables 1 and 2 report the performance of the state-of-the-art approaches for 2017 and the 2019 ASVspoof replay benchmark datasets, respectively. The study in [72] achieved the best performance, with a testing error rate of 1.07% on the ASVspoof 2017 replay benchmark dataset. For the ASVspoof 2019 replay benchmark dataset, the countermeasure proposed in [50] obtained the best performance, with a testing error rate of 0.59%.  Tables 1 and 2 report the performance of the state-of-the-art approaches for 2017 and the 2019 ASVspoof replay benchmark datasets, respectively. The study in [72] achieved the best performance, with a testing error rate of 1.07% on the ASVspoof 2017 replay benchmark dataset. For the ASVspoof 2019 replay benchmark dataset, the countermeasure proposed in [50] obtained the best performance, with a testing error rate of 0.59%.

Discussion
Neither conventional nor deep learning approaches have managed to overcome the challenge posed by the high variation of utterances. Indeed, these models suffer from generalization issues. In other words, while these countermeasures increase the prediction performance of trained utterances, they are unable to generalize utterances. Alternatively, the countermeasure proposed in [72] addressed the generalization problem by mining hidden partitions of the genuine and spoofed utterances separately. Nevertheless, while taking into account the intra-class variance by learning the underline structure of each class, this solution did not consider overlaps between genuine and spoofed categories. Indeed, this method did not learn the overall underlying structure of the data.

Proposed Approach: Generalized Replay Spoofing Countermeasure Based on Combining Local Sub-Classification Models
We propose an alternative approach that mines the hidden structure of the whole data. More specifically, the proposed countermeasure splits the classification problem into local subproblems. In other words, in order to avoid learning a complex classification model for the whole data, we intend to split the data into groups formed of congregated instances and build a simpler classification model from each group. These groups are heterogeneous and include spoofing and genuine utterances assigned to the same group due to their similarities. By classifying the utterances of each cluster into spoofing and genuine, a classification model is learned with respect to each cluster. This results in a set of local classification models. Using these models, the classification of an unknown instance is then achieved through an ensemble learning approach that combines the obtained local models.
The proposed spoofing countermeasure is depicted in Figure 3. First, audio features are extracted from the recorded utterances. Then, the three clustering techniques of FCM [14], SCAD [15], and CA [13] are investigated to partition the data. FCM-based clustering approaches are explored because they learn the cluster centers while also learning a fuzzy partition of the data. Alternatively, SCAD has the advantage of learning relevant feature weights and their combinations while clustering the data, whereas CA learns the number of homogeneous partitions automatically. From each cluster containing both spoofed and genuine instances, a classification model is learned. We propose to employ GMM [6] and SVM [17] as classification techniques, since these models were effective in the prediction of spoofed utterances [4,5,37,42,48,49,60]. Lastly, an ensemble learning technique is adopted to classify unknown instances by combining the decisions of the learned models. More specifically, the pairwise distances between the unknown utterance and the cluster representatives are computed. The classification model corresponding to the closest sub-group is then used for classifying this utterance.
To better illustrate the proposed spoofing countermeasure based on local classification subproblems, an example is presented in Figure 4. In this example, audio instances are clustered into six groups: {R 1 , R 2 , R 3 , R 4 , R 5 , R 6 }. Although the training set was labeled into spoofing and genuine instances, these labels were not used for the clustering task. In fact, the whole data were considered without consideration of the ground truth. Therefore, each obtained cluster included both spoofing and genuine instances. Then, during the training phase, a classification model was learned from each cluster. This resulted in a set of six classification models, which were used to classify the unknown utterance through ensemble learning techniques. For example, since the unknown utterance is closest to cluster R 1 , the model learned for R 1 will be employed for its classification. Moreover, for each subgroup, a set of classifiers were investigated. This set contained the Gaussian nixture nodel (GMM) classifier [6], support vector machine (SVM) [17], and XGBoost [18]. To minimize learning errors and enhance the overall learning performance of each local subproblem, ensemble learning [75] was exploited to combine the considered classification results. For this purpose, the majority strategy was employed [75].

Experiments
To assess the performance of the proposed approach, two replay datasets were considered: the ASVspoof 2017 version 2.0 benchmark dataset [35] and the ASVspoof 2019 benchmark dataset [36]. The audio files included in these datasets are characterized by a 16 kHz sampling rate and 16-bit resolution. As reported in Table 3, ASVspoof 2017 v2.0 was split into three subsets. The first subset was a training set that contained 3016 files, of which 1507 were genuine, and 1507 were replay spoofing files. The second subset was a development set containing 1710 files, of which 760 were genuine, and 950 were replay spoofing files. The third subset was an evaluation set containing 13,306 files, of which 1298 were genuine, and 12,008 were replay spoofing files.

Experiments
To assess the performance of the proposed approach, two replay datasets were considered: the ASVspoof 2017 version 2.0 benchmark dataset [35] and the ASVspoof 2019 benchmark dataset [36]. The audio files included in these datasets are characterized by a 16 kHz sampling rate and 16-bit resolution. As reported in Table 3, ASVspoof 2017 v2.0 was split into three subsets. The first subset was a training set that contained 3016 files, of which 1507 were genuine, and 1507 were replay spoofing files. The second subset was a development set containing 1710 files, of which 760 were genuine, and 950 were replay spoofing files. The third subset was an evaluation set containing 13,306 files, of which 1298 were genuine, and 12,008 were replay spoofing files.

Experiments
To assess the performance of the proposed approach, two replay datasets were considered: the ASVspoof 2017 version 2.0 benchmark dataset [35] and the ASVspoof 2019 benchmark dataset [36]. The audio files included in these datasets are characterized by a 16 kHz sampling rate and 16-bit resolution. As reported in Table 3, ASVspoof 2017 v2.0 was split into three subsets. The first subset was a training set that contained 3016 files, of which 1507 were genuine, and 1507 were replay spoofing files. The second subset was a development set containing 1710 files, of which 760 were genuine, and 950 were replay spoofing files. The third subset was an evaluation set containing 13,306 files, of which 1298 were genuine, and 12,008 were replay spoofing files. As shown in Table 4, the ASVspoof 2019 replay spoofing dataset comprised a training set with 48,600 spoofed utterances and 5400 genuine utterances, a development set with 24,300 spoofed utterances and 5400 genuine utterances, and an evaluation set containing various randomly chosen acoustic and playback configurations [36]. From the audio files, three audio features were extracted: mel-frequency Cepstral coefficients (MFCCs) [43], the constant Q Cepstral coefficients (CQCCs) [4], and the linear frequency Cepstral coefficient (LFCC) [5]. The equal error rate (EER) [76] is considered as the performance measure. EER represents the operating point at which the false acceptance rate (FAR) and false rejection rate (FRR) are equal [76].

Experiment 1: Number of Clusters and Audio Feature Investigation
In this experiment, the FCM [14] clustering approach was employed to mine the hidden structure of the data. This approach partitions the whole ASVspoof 2017 benchmark dataset into homogeneous local subgroups. Each subgroup contained genuine and replay spoofed utterances, the latter of which constituted a local classification subproblem. Two classifiers, SVM [17] with linear kernel and GMM [6] with two mixture components, were utilized to solve these subproblems. Furthermore, to explore the structure of the data, different numbers of clusters were considered. These numbers were tuned between 2 and 15. Moreover, the data were clustered using CQCC, MFCC, and LFCC features. Each feature was considered independently and concatenated together. Figures 5 and 6 depict the EER obtained with MFCC, QCC, LFCC, and concatenation together with respect to the cluster number when considering SVM and GMM classifiers, respectively. The results indicate that performance varied with respect to the number of clusters, the type of audio features, and the classifier.
For SVM-based systems, CQCC features generally performed better than the other considered features, especially when the cluster number was less than 9. However, the best performance was achieved with two clusters. Alternatively, for GMM-based approaches, the best performance was attained with four clusters. Nonetheless, CQCC remained the best performing feature type. Table 5 reports the best performance achieved by each combination of feature/classifier for the optimal number of clusters. The system that used CQCC features with two clusters had the smallest EER (1.61%) and thus outperformed the other combinations. The second best was the system employing CQCC features and GMM, with an ERR of 4.23%.  For SVM-based systems, CQCC features generally performed better than the other considered features, especially when the cluster number was less than 9. However, the best performance was achieved with two clusters. Alternatively, for GMM-based approaches, the best performance was attained with four clusters. Nonetheless, CQCC remained the best performing feature type. Table 5 reports the best performance achieved by each combination of feature/classifier for the optimal number of clusters. The system that used CQCC features with two clusters had the smallest EER (1.61%) and thus outperformed the other combinations. The second best was the system employing CQCC features and GMM, with an ERR of 4.23%.  For SVM-based systems, CQCC features generally performed better than the other considered features, especially when the cluster number was less than 9. However, the best performance was achieved with two clusters. Alternatively, for GMM-based approaches, the best performance was attained with four clusters. Nonetheless, CQCC remained the best performing feature type. Table 5 reports the best performance achieved by each combination of feature/classifier for the optimal number of clusters. The system that used CQCC features with two clusters had the smallest EER (1.61%) and thus outperformed the other combinations. The second best was the system employing CQCC features and GMM, with an ERR of 4.23%.

Experiment 2: Self-Learning the Number of Clusters
In this experiment, the hidden partition was discovered automatically using the competitive agglomeration (CA) [13] clustering approach to simultaneously partition the training utterances and estimate the cluster number. The cluster number was first set to 100. The ASVspoof 2017 and the ASVspoof 2019 benchmark datasets were considered in this experiment. Table 6 reports the EER obtained when employing SVM as a classifier along with the cluster number learned for each feature. As shown in Table 6, CQCC achieved the lowest EER of 1.42% and 1.63% on the ASVspoof 2017 and ASVspoof 2019 dataset, respectively, with an optimal number of clusters equal to 2. Starting from a large number of 100, CA achieved similar results to those obtained in the first experiment by tuning the number of clusters. Alternatively, Table 7 reports the obtained EERs when using GMM classifier along with the cluster number learned for each feature. The results confirm the superiority of the CQCC features, which achieved an EER of 1.38% and 1.46% on the ASVspoof 2017 and ASVspoof 2019 dataset, respectively, while learning an optimal number of clusters equal to four. This result is consistent with the results obtained by exploring the cluster number in experiment 1. This suggests that CA can discover the hidden partitions of the data while self-learning the optimal number of clusters.

Experiment 3: Feature Relevance Weight Learning
In this experiment, simultaneous clustering and attribute discrimination (SCAD) [15] was used to mine the hidden structure of the data, and learnt the relevant feature weights of CQCC, MFCC, and LFCC. First, the cluster number was set to 2 for SVM-based systems and 4 for GMM-based systems, in accordance with the obtained results in Experiments 1 and 2. Table 8 reports the learned feature relevance weights. As shown in Table 8, the largest weight was assigned CQCC. This result is consistent with Experiment 1 findings, which proved that CQCC is more suitable. Next, we discarded MFCC and LFCC, and applied SCAD to the entries of CQCC to learn the relevance of each entry. The considered cluster numbers were between 2 and 16. Figure 7 shows the achieved EER for each cluster number when employing SCAD and SVM [17] on the ASVspoof 2017 version 2.0 benchmark dataset. The lowest EER, equal to 0.154, was achieved for 2 clusters. When using the GMM classifier, the lowest EER was equal to 0.302 with four clusters, as shown in Figure 8. This suggests that employing SCAD on the CQCC gave better performance because this approach handled the large dimension of CQCC feature by computing the weighted sum of the feature entrees.
In this experiment, simultaneous clustering and attribute discrimination (SCAD) [15] was used to mine the hidden structure of the data, and learnt the relevant feature weights of CQCC, MFCC, and LFCC. First, the cluster number was set to 2 for SVM-based systems and 4 for GMM-based systems, in accordance with the obtained results in Experiments 1 and 2. Table 8 reports the learned feature relevance weights. As shown in Table 8, the largest weight was assigned CQCC. This result is consistent with Experiment 1 findings, which proved that CQCC is more suitable. Next, we discarded MFCC and LFCC, and applied SCAD to the entries of CQCC to learn the relevance of each entry. The considered cluster numbers were between 2 and 16. Figure 7 shows the achieved EER for each cluster number when employing SCAD and SVM [17] on the ASVspoof 2017 version 2.0 benchmark dataset. The lowest EER, equal to 0.154, was achieved for 2 clusters. When using the GMM classifier, the lowest EER was equal to 0.302 with four clusters, as shown in Figure 8. This suggests that employing SCAD on the CQCC gave better performance because this approach handled the large dimension of CQCC feature by computing the weighted sum of the feature entrees.

Ensemble Learning
On the basis of the findings of previous experiments, we only considered CQCC features in this experiment. Next, we applied the CA [13] algorithm to estimate the optimal cluster number. The learned fuzzy memberships were next used as the initial values for the SCAD [15] clustering algorithm. Three classifiers were first considered separately: SVM [17], GMM [6], and XGboost [18]. Next, the results of these classifiers were combined using the majority vote ensemble learning strategy. Table 9 depicts the achieved ERR of the considered systems. As shown in Table 9, the proposed approach based on SVM [17] outperformed those based on XGboost [18] and GMM [6] with an ERR equal to 0.154%. Furthermore, the ensemble majority voting strategy further improved performance by achieving an ERR equal to 0.097%.

Ensemble Learning
On the basis of the findings of previous experiments, we only considered CQCC features in this experiment. Next, we applied the CA [13] algorithm to estimate the optimal cluster number. The learned fuzzy memberships were next used as the initial values for the SCAD [15] clustering algorithm. Three classifiers were first considered separately: SVM [17], GMM [6], and XGboost [18]. Next, the results of these classifiers were combined using the majority vote ensemble learning strategy. Table 9 depicts the achieved ERR of the considered systems. As shown in Table 9, the proposed approach based on SVM [17] outperformed those based on XGboost [18] and GMM [6] with an ERR equal to 0.154%. Furthermore, the ensemble majority voting strategy further improved performance by achieving an ERR equal to 0.097%. Bold number indicates the lowest EER value.

Experiment 4: Performance Comparison with Related Spoofing Detection Approaches
According to the previous experiments' findings, the SCAD clustering algorithm with CQCC achieved the best performance with respect to the three considered classifiers. As such, in this experiment, we considered four versions of the proposed approach using CQCC for feature extraction and SCAD for mining the structure of the data. These approaches use SVM, GMM, XGBoost, and their combination. These methods are referred to as the local-SVM-based approach, local-GMM-based approach, local-XGBoost-based approach, and local-ensemble-learning-based approach. We also compared the performance of the proposed approaches to three state-of-the-art approaches. The first was the approach reported in [4], which consisted of extracting the CQCC feature and conveying

Experiment 4: Performance Comparison with Related Spoofing Detection Approaches
According to the previous experiments' findings, the SCAD clustering algorithm with CQCC achieved the best performance with respect to the three considered classifiers. As such, in this experiment, we considered four versions of the proposed approach using CQCC for feature extraction and SCAD for mining the structure of the data. These approaches use SVM, GMM, XGBoost, and their combination. These methods are referred to as the local-SVM-based approach, local-GMM-based approach, local-XGBoost-based approach, and local-ensemble-learning-based approach. We also compared the performance of the proposed approaches to three state-of-the-art approaches. The first was the approach reported in [4], which consisted of extracting the CQCC feature and conveying it to a GMM-based classifier. The second approach, which is the most recent, was reported in [68], and it uses SCAD to cluster genuine utterances into G clusters, with the spoofed utterances placed into two S clusters. As such, this approach assigns the unknown instance as the class of the closest cluster (refer to Section 3.3). The third baseline approach, published in [46], was the best performing method for the ASVspoof replay 2019 dataset. This approach uses CQCCS and spectrogram features and conveys them to the SENet [47], Mean-Std ResNet [47], and Dilated ResNet [48] deep-learning models. Then, the greedy fusion scheme described in [49] was employed to explore the best system combination.
For this purpose, we considered the two available replay datasets: ASVspoof 2017 v2 [72] and ASVspoof 2019 [73]. The same datasets with the same training and testing sets were employed for all considered approaches. To evaluate the generalization capabilities of the proposed approach, both the training and testing ERR were compared, as reported in Table 10, where the proposed approach based on ensemble learning outperformed all other considered systems with respect to the two datasets. Nonetheless, even without considering ensemble learning, the three other approaches achieved smaller EERs than the state-of-the-art ones, except for the local-GMM-based approach, which offered the same performance as baseline approach 3 on the ASVspoof replay 2019 dataset. This result was achieved by dividing the classification problem into sublocal problems to address the utterance high variance problem and was confirmed by the training and testing ERR results. The difference between the training and testing ERR was reduced. This result shows that the generalization problem was addressed.

Conclusions and Future Works
Spoofing detection approaches is crucial to protect the user data against voice spoofing attacks while using ASV. These spoofing detection approaches amount to a classification problem where audio utterances are categorized into genuine or spoofed classes. However, this task remains challenging due to the high variance of the utterances. This factor affects the model's generalization for unseen utterances.
In this paper, we devised a new replay countermeasure to address the high variance of these utterances. This countermeasure was performed by dividing the challenging classification problem into a set of local subproblems by mining the hidden structure of the data. Then, ensemble learning was used to combine these submodels. Various features, clustering techniques, classifiers, and their combinations were investigated. The experiments showed that CA clustering can automatically learn the number of homogeneous partitions of the data. Moreover, the experimental results showed that CQCC audio features along with the SVM classifier and SCAD clustering technique are the most suitable techniques to build the proposed approach. As a result, the latter method outperformed state-of-the-art approaches. Furthermore, when combining the results of the three classifiers (SVM, GMM, and XGBoost), the proposed approach achieved even better results.
In future work, other audio features, classifiers, clustering techniques, and ensemble learning strategies could be investigated. Moreover, the performance of the proposed approaches on other types of voice spoofing could be explored.