Generalized Spoof Detection and Incremental Algorithm Recognition for Voice Spoofing

Guo, Jinlin; Zhao, Yancheng; Wang, Haoran

doi:10.3390/app13137773

Open AccessArticle

Generalized Spoof Detection and Incremental Algorithm Recognition for Voice Spoofing

by

Jinlin Guo

^*,

Yancheng Zhao

and

Haoran Wang

College of Systems Engineering, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(13), 7773; https://doi.org/10.3390/app13137773

Submission received: 11 May 2023 / Revised: 11 June 2023 / Accepted: 11 June 2023 / Published: 30 June 2023

Download

Browse Figures

Versions Notes

Abstract

:

Highly deceptive deepfake technologies have caused much controversy, e.g., artificial intelligence-based software can automatically generate nude photos and deepfake images of anyone. This brings considerable threats to both individuals and society. In addition to video and image forgery, audio forgery poses many hazards but lacks sufficient attention. Furthermore, existing works have only focused on voice spoof detection, neglecting the identification of spoof algorithms. It is of great value to recognize the algorithm for synthesizing spoofing voices in traceability. This study presents a system combining voice spoof detection and algorithm recognition. In contrast, the generalizability of the spoof detection model is discussed from the perspective of embedding space and decision boundaries to face the voice spoofing attacks generated by spoof algorithms that are not available in the training set. This study presents a method for voice spoof algorithms recognition based on incremental learning, taking into account data flow scenarios where new spoof algorithms keep appearing in reality. Our experimental results on the LA dataset of ASVspoof show that our system can improve the generalization of spoof detection and identify new voice spoof algorithms without catastrophic forgetting.

Keywords:

generalizability; voice spoof detection; incremental learning; spoof algorithm recognition

1. Introduction

In recent years, the development of deep learning and the emergence of deep neural networks have lowered the threshold for voice spoofs. This has led to spoofing attacks against automatic speaker verification, increasing security concerns, because voice spoofing poses a severe threat to basic applications of automatic speaker verification, such as phone unlocking and WeChat authentication. These attacks mainly include imitation (mimicry or twinning), replay (prerecorded audio), text-to-speech (converting text to spoken language), and voice conversion (converting voice from the source speaker to the target speaker) [1,2]. Synthetic speech attacks, including text-to-speech (TTS) and voice conversion (VC), pose an increasing threat to speaker verification systems owing to the rapid development of speech synthesis technologies [3,4].

It is essential to develop independent spoof detection modules to counteract the harm of speech attacks on speaker verification systems. The ASVspoof challenge [2,5,6] has provided datasets and metrics for anti-spoofing speaker verification research. Most traditional approaches rely on feature engineering, such as the Linear Frequency Cepstral Coefficient (LFCC), Constant Q Cepstral Coefficient (CQCC), and Mel Frequency Cepstral Coefficient (MFCC), which have yielded promising results in voice spoof detection. However, the generalization of these manual features is relatively poor, and it is difficult to cope with the rapid development of deep learning techniques. Zhang et al. [7] studied deep learning models for spoof detection and demonstrated that a combination of convolutional neural networks (CNN) and recurrent neural networks (RNN) could improve the robustness of the system. Gomez-Aranes et al. [8] used a lightweight convolutional gated RNN architecture to improve the long-term dependence on spoof detection. Wu et al. [9] proposed a lightweight CNN system based on feature generation, outperforming other single systems in detecting synthetic attacks. Although these methods have shown excellent results in detecting existing spoof voices, they are difficult to generalize in voice attacks generated by spoof algorithms unavailable in the training set. The reason is that they are implemented assuming that the training and test data have the same or similar distributions.

However, new spoofing methods have evolved, and the distribution of the new voice spoofing samples may be quite different from that of the voice spoofing data used for training. This difference causes the previously trained model to be useless for such spoofing samples. To improve the generalizability of the trained spoof detection model, a transfer learning method called domain adaptation can adapt the trained model to new spoofing voices. However, domain adaptation has a limitation when the distributions of the source and target domains have a large gap. We often have adequate data on real voices, which is not significantly affected by time. Therefore, we can use a one-class classification approach [10] to solve the distribution mismatch problem when classifying sounds. Real voices belong to a target class which distribution can be fully trained. In contrast, spoof voices belong to a non-target class which samples do not exist in the training set or are not statistically representative.

Moreover, most research has focused only on the authenticity of the voice while ignoring the synthesis algorithms behind it. This study focuses on both voice attack detection and voice spoof algorithm recognition. The real voice is distinguished from the voice generated by the TTS and VC algorithms, and then, the voice that is judged to be the spoof is further identified from the spoof algorithm. Voice spoof algorithm recognition can be considered a multi-classifier. As spoof algorithms are continuously proposed, the classifier is either retrained on an increasingly large training set or fine-tuned. The former solution is infeasible, as the data grows sequentially. The latter may encounter a well-known phenomenon known as catastrophic forgetting [11], which brings about a broken loop in the knowledge learned during previous training. Incremental learning [12] is considered a strategy for overcoming the problems of excessive storage and catastrophic forgetting while fully considering the data flow scenarios where new spoof algorithms continue to appear.

This study proposes a voice spoof detection and recognition system that includes two modules: a voice spoof detection module and a voice spoof algorithm recognition module. The first module focuses on the generalizability of spoof detection. In the second module, a method capable of classifying the voice generated by new spoof algorithms is proposed without degrading the recognition accuracy of previous spoof algorithms.

The contributions of this paper can be summarized as follows.

(1): The generalizability of the spoof detection model is discussed from the perspective of the embedding space and decision boundaries. The embedding space of real voices is compactly distributed, whereas that of the spoofing voices is more dispersed. A highly generalized decision boundary is learned to maintain them at a certain distance from each other.
(2): A method for voice spoof algorithm recognition based on incremental learning is presented, which can adapt to the feature distribution of the new algorithms for spoofing voices without having an excessive impact on the feature distribution of the old algorithms.
(3): A system combining spoof detection and algorithm recognition is first proposed, which facilitates the traceability of spoofing voices by implementing spoof algorithm recognition following detection. The experimental results demonstrate the effectiveness of the proposed system.

2. Related Work

2.1. Domain Adaptation

Domain adaptation is one of the most popular branches of transfer learning, which improves the performance of a model on a test dataset using the knowledge gained from training on the training dataset. There are two fundamental concepts in domain adaptation: the source and target domains. The source domain is valuable in supervised learning information. The target domain represents the domain in which the test set is located, typically without labels or containing only a few labels. The source and target domains are often the same task types but with different distributions. The current deep domain adaptive algorithms mainly focus on learning a region’s invariant space by shared feature extraction of the source and target data. To measure the difference in the distribution of features between two regions, existing research has utilized different measures for domain adaptation, such as the maximum mean difference (MMD), different orders of statistical moments of the distribution, adversarial learning, and conditional adversarial learning [13,14]. Mesay et al. [15] applied the same model to two image datasets with different distributions using a domain adaptation method and obtained good performance. Claudia et al. [16] used a domain adaptation method to classify unavailable ground truth images by using different but related images.

2.2. One-Class Learning Method

A one-class classification was developed for abnormal detection. It considered only the similarity or resemblance between the input samples and the existing samples, whereas samples of abnormal types were uniformly excluded. The key idea of a one-class classification approach is to capture the target class distribution and set strict classification boundaries around it to place all non-target data outside the boundaries. Alegre et al. [17] demonstrated the potential of a one-class classification approach using a one-class support vector machine (OC-SVM) trained only with a real voice. A feature space was learned using a one-class softmax loss function [18] to classify the local binary patterns of Cepstral voices without using forged audio samples.

2.3. Prototype Learning

A prototype represents the average or best example of a category and can provide a concise representation of all the classified examples. In addition, Ref. [19] proposed storage space-saving learning vector quantization (LVQ) to improve the efficiency of KNN operations. LVQ has been studied in many studies, with many variations. In most previous studies, prototypes were learned by optimizing customized object functions [20]. Researchers have added prototype learning based on probabilistic models and neural networks to the classification process in recent years. In [21], the authors expressed input instances using K-dimensional vectors with a probabilistic mixture of components and parameterized K-type prototypes using probabilistic models. A set of prototypes was proposed to make explanatory neural network predictions based on the inputs’ similarity to a set of learned prototypes [12]. In this study, we not only optimized the classification prototypes but also focused on the use of tight and separable feature representations within and between classes. This makes incremental learning more discriminative and robust.

2.4. Incremental Learning

Incremental learning is often used in data flow scenarios where new classes continue to appear. In most examples, only a small number of classes are available initially, and new classes emerge. Studies on incremental learning have a long history in machine learning and artificial intelligence (AI). Some past works, such as [22,23,24,25], focused on continuously updating the training set with data obtained from new classes. These approaches are characterized by (i) being limited to a fixed feature representation for learning or (ii) continuously preserving data from new categories to retrain the model. However, their biggest drawback is that the number of parameters increases rapidly with the total computational resources. In addition, this approach requires more space to store the initial training set and new class samples; however, this is not applicable in practice. This study focused on storing training instances and updating feature representations during incremental learning.

3. Voice Spoof Detection and Algorithm Recognition System

The framework of our system is shown in Figure 1.

To prevent voice spoof detection and spoof algorithm identification from interfering with each other, we divided them into two separate modules. First, we input the test voice into the voice spoof detection module to identify the authenticity of the voice. If the judgment was true, the test was completed. If the voice was judged to be spoofed, it was input into the voice spoof algorithm recognition module for spoof algorithm recognition.

The following section describes the methods used in these two modules. Section 3.1 describes the methods for detecting the authenticity of a voice, and Section 3.2 gives the recognition methods used for the voice spoof algorithm in incremental scenarios.

3.1. Generalized Voice Spoof Detection

3.1.1. Domain Adaptation Based on Decision Boundary Maximization

The domain-adaptive approach utilizes two types of networks: task-specific classifiers and feature mappers. To find the target samples near the decision boundary, two independent classifiers were introduced. The features were obtained from the mappers using both classifiers, and an attempt was made to classify the source samples correctly. This is shown in Figure 2a.

The different distributions of the source and target domains lead to differences in the classification results of the two classifiers for the target samples near the decision boundary, and they cannot be correctly classified for these samples. Adversarial learning methods were introduced for the samples that could not be correctly classified near the decision boundary. First, a task-specific classifier was used to maximize the decision boundary to increase the difference in the results between the two classifiers, as shown in Figure 2b. Then, a mapper was used to produce a better mapping effect to reduce the difference, as shown in Figure 2c. Thus, by considering the relationship between the target samples and task-specific decision boundaries, it is possible to make the model learned on the training set applicable to different distributions of faked audio in the test set. The specific procedure is as follows.

Phase 1: As shown in Figure 2a, shadows are generated, allowing different feature representations to be learned by different classifiers. The goal was to maximize the shaded area.

Phase 2: As shown in Figure 2b, the classifiers are maximized. The shading is more significant at this point. Therefore, the goal was to reduce the divergence between the two classifiers by extracting better features.

Phase 3: The divergence is minimized, as shown in Figure 2c. At this point, the shadows are almost nonexistent, but the problem is that there is still some compactness in the decision boundary. These features might not be robust, so alternating optimization with the previous stage continued.

Phase 4: The final goal of the alternate optimization is shown in Figure 2d. At this point, shadows do not exist, and the decision boundary is more robust.

This study constructed two classifiers as discriminators using the dropout regularization method for the primary classifier [26]. The other represents the difference between the two classifiers based on the symmetric Kullback–Leibler (KL) scatter. The procedure is as follows: the features

G (x_{t})

obtained from feature mapper G are fed to classifier network C twice. A different node was removed each time to obtain two different output vectors. These are denoted as

C_{1} (G (x_{t})) and C_{2} (G (x_{t}))

, respectively. The corresponding posterior probabilities are denoted as

p_{1} (y| x)

and

p_{2} (y| x)

and abbreviated as

p_{1}

and

p_{2}

below. The discriminator then attempts to increase the difference between the predictions of

C_{1}

and

C_{2}

. This difference corresponds to the discriminator’s sensitivity to noise, measured using the symmetric KL scatter between the two obtained probability outputs

d (p_{1}, p_{2})

, and the scatter is calculated as

d (p_{1}, p_{2}) = \frac{1}{2} (D_{kl} (p_{1} {| p}_{2}) + D_{kl} (p_{2} {| p}_{1}))

(1)

where

D_{kl} (p | q)

denotes the KL scatter between

p

and

q

.

However, the decision boundaries obtained from such adversarial training methods were still set for the training data. Therefore, the decision boundary set for the voice spoofing samples in the embedding space might be too compact. This led to difficulties for the model in defending against unseen voice spoofing attacks. Therefore, we introduced a one-class learning method. It set decision boundaries for real data and forged data separately, which improved the model’s defense against unseen voice spoofing attacks.

3.1.2. Domain Adaptation Based on Decision Boundary Maximization

During model inference, the voice spoof detection task may encounter many voice samples of unknown attack types. The types of voice attacks included in the training set are often limited. This leads to an extreme mismatch between the data feature distributions of the training and test sets for the spoofing samples. Suppose the voice identification spoof problem is viewed purely as a dichotomous classification problem. In that case, when the model is inferred, the model tends to have a degraded performance when faced with many new samples. This one-classification task scenario is well suited to the idea of one-class classification.

The traditional voice spoof detection task uses a binary classification loss function expressed as

L = - \frac{1}{M} \sum_{i = 1}^{M} \log \frac{e^{ω_{{label}_{i}}^{T} * x_{i}}}{e^{ω_{{label}_{i}}^{T} {* x}_{i}} + e^{ω_{1 - {label}_{i}}^{T} {* x}_{i}}}

(2)

where

x_{i} \in ℝ^{D}

is the embedding vector of the

ith

voice sample, and

{label}_{i} \in \{0, 1\}

is the corresponding label. AM-Softmax introduces an angular margin into the traditional binary loss function. The distances between the data within each class are minimized while maximizing the distance between the target and non-target classes. The decision boundary was made to be more compact. AM-Softmax is expressed as follows:

L = - \frac{1}{M} \sum_{i = 1}^{M} \log \frac{e^{ξ ({\hat{ω}}_{{label}_{i}}^{T} {* \hat{x}}_{i} - m)}}{e^{ξ ({\hat{ω}}_{{label}_{i}}^{T} {* \hat{x}}_{i} - m)} + e^{{ξ \hat{ω}}_{1 - {label}_{i}}^{T} {* \hat{x}}_{i}}} = \frac{1}{M} \sum_{i = 1}^{M} \log (1 + e^{ξ {(m - ({\hat{ω}}_{{label}_{i}} - {\hat{ω}}_{1 - {label}_{i}}))}^{T} {\hat{x}}_{i}})

(3)

where

ξ

denotes the scaling factor, m is the cosine similarity margin, and

\hat{ω}

and

\hat{x}

are the normalized

ω

and

x

, respectively. It can be seen from the equation that the embedding vectors of both the target and non-target classes tend to converge in two opposite directions, i.e.,

ω_{0} - ω_{1}

and

ω_{1} - ω_{0}

, respectively, where

ω_{0}

is the target class represented by the real voice and

ω_{1}

is the non-target class represented by the spoofing voice. For AM-Softmax, the embedding features of both the target and non-target classes were set to the same angular margin as

m

.

In a realistic scenario, because the feature distribution of the non-target class, the spoofing voice, differs significantly from that of the target class, training the same decision boundary for both often results in overfitting the model to known attack types in the training set. Therefore, we considered introducing different angular margins

m

for the real and spoofing samples. Different decision bounds were trained to identify real voice samples better and isolate voice spoofing samples. The designed loss function is expressed as follows:

L_{oc} = \frac{1}{M} \sum_{i = 1}^{M} \log (1 + e^{ξ (m_{{label}_{i}} - {\hat{ω}}_{0} {\hat{x}}_{i}) {(- 1)}^{{label}_{i}}})

(4)

In OC-Softmax, only one weight

ω_{0}

is used, representing the optimization direction of the target class embedding vector. Both are normalized similarly to AM-Softmax. We then introduced two different angular residuals

m_{0}

,

m_{1}

∈ [−1, 1] for the classification of real voices and spoofing voices, respectively, where

m_{0} > m_{1}

.

As shown in Figure 3, there is only one

ω_{0}

weight in OC-Softmax, indicating the convergence direction of the target class of the real voice. The angle between

ω_{0}

and voice feature vector

x_{i}

is denoted by

θ_{i}

. When

{label}_{i}

= 0, the sample is the target class, and the decision boundary is constrained by setting

m_{0}

so that

θ_{i}

has to be smaller than

\arccos m_{0}

. When the value of

\arccos m_{0}

is small, the target class can be concentrated around the weight vector

ω_{0}

, thereby making the feature distribution of the real voice sample more compact and concentrated. When

{label}_{i}

= 1, set

m_{1}

so that

θ_{i}

is greater than

\arccos m_{1}

. When the value of

\arccos m_{1}

is larger, the data of the non-target class can be moved away from the direction of

ω_{0}

. Thus, the purpose of training different decision boundaries for the target class and non-target class samples is achieved.

3.2. Incremental Spoof Algorithms Recognition

Spoof algorithm identification is essentially a multi-classification problem. The difficulty is that the model must be updated, as spoof algorithms are continuously proposed in reality, which often faces the problem of over-storage or catastrophic forgetting. For this reason, we construct a sparser embedding space using a prototype so that the appearance of new classes does not have an outsized impact on the old ones. This allows the system to adapt to incremental spoofing algorithm recognition.

3.2.1. Prototype Learning

The purpose of voice spoof algorithm recognition is to extract the feature representation of the voice. In turn, the spoof algorithm used for the voice is determined. The objective of this task is expressed as follows:

y = f (x, θ)

(5)

where

θ

is the model parameter, and

y \in \{1, 2, \dots, C\}

denotes the spoof algorithm class label. Unlike the traditional ResNet, which uses softmax layers to perform linear classification on the learned features, we learn multiple prototypes of the features of each class by maintaining them. The classification was performed using prototype matching, as shown in Figure 4. The prototypes are represented as

m_{i, j}

, where

i \in \{1, 2, \dots, C\}

denotes the index of the class, and

j \in \{1, 2, \dots, K\}

denotes the index of the prototype in each class. Here, we assume that each class contains the same number of

K

prototypes. The ResNet feature extractor

f (x, θ)

and prototypes

\{m_{i, j}\}

are jointly trained from the data. In the classification phase, we classify the objects by prototype matching, i.e., we find the nearest prototype based on Euclidean distance. The prototype class is assigned to a specific object. It is expressed as

x \in y \arg \begin{matrix} C \\ m a x \\ i = 1 \end{matrix} g_{i} (x)

(6)

where

g_{i} (x)

is the discriminant function for class

i

:

g_{i} (x) = - \begin{matrix} K \\ m i n \\ j = 1 \end{matrix} ∥ f (x; θ) - m_{i, j} {||}_{2}^{2}

(7)

For this purpose, the probability of a sample assigned to each category by using the distance of the sample from the prototype is expressed as

p (x \in m_{i, j} | x) = \frac{e^{- d (f (x), m_{i, j})}}{\sum_{k = 1}^{C} \sum_{i = 1}^{K} e^{- d (f (x), m_{k, j})}}

(8)

where

d (f (x), m_{i, j}) = ∥ f (x; θ) - m_{i, j} ∥_{2}^{2}

. Then, the objective function can be expressed as

l o s s_{1} = - \log (\sum_{j = 1}^{K} p (x \in m_{i, j} | x))

(9)

Direct minimization of the classification loss may lead to overfitting. To avoid this, prototype loss is added as a regularization to improve the model’s generalization ability. The so-called prototype loss—that is, the loss centered on the centroid of the subclasses—is used to determine the class to which the input

x

belongs to. Then, its decision boundary is the location where the distances to the centers of the subclasses of two adjacent classes are equal. It can be expressed as

l o s s_{2} ((x, y); θ, M) = ∥ f (x) - m_{y, j} ∥_{2}^{2}

(10)

where

m_{y, j}

is the prototype closest to

f (x)

in the corresponding class

y

. A relatively sparse feature space was obtained, which later facilitated incremental learning.

3.2.2. Incremental Learning Module

Voice spoof algorithms are constantly evolving and constantly updating the dataset over time. Therefore, the model cannot initially include all voice spoof algorithm data. The incremental learning module enables the model to continuously learn new voice spoof algorithms through prototypes and example sets, avoiding the need for the model to retrain all the data and saving a significant amount of training time.

As shown in Figure 5, the model learns the prototype at the

(t - 1)

th iteration stage and saves a small number of examples for each classification algorithm. At the

t

th iteration learning, the model adds

n

classes of prototypes to the

(t - 1)

th iteration prototypes, each containing identical

K

prototypes, based on the number

n

of voice spoof algorithm types used in the newly added dataset. The model is then trained on the new voice spoofing dataset and the example set.

The amount of data stored in each class in the example set is much smaller than in the new voice spoofing training set. Therefore, the model suffers from a data imbalance during training. For this reason, we added different weights to each category when calculating the loss of each category based on the number of examples saved in the old category and the number of voices generated by the spoof algorithm in each of the new datasets. Thus, the impact on the recognition results due to data imbalance is reduced. The representation is as follows:

l o s s (x, y) = w e i g h t [y] (- x [y] + \log (\sum_{j} \exp (x [j])))

(11)

where

y \in \{1, 2, \dots, C\}

denotes the classification labels, and weight denotes the loss weights for each category.

Thus, the total loss function of the incremental learning module is denoted as

l o s s_{t o t a l} = w_{1} * l o s s_{1} + w_{2} * l o s s_{2} + l o s s_{3}

(12)

where

w_{1}

and

w_{2}

are the weight parameters set to 0.1 and 0.01, respectively, in our experiments.

l o s s_{3}

is the cross-entropy loss function for classification used to train the classification ability of the neural network, denoted as

l o s s_{3} = - \frac{1}{S} \sum_{i = 1}^{S} [y_{i} * \log (s o f t m a x (f_{i}))]

(13)

where

S

is the batch size, and

f_{i}

is the feature of the

i

th sample in the batch.

4. Experimental Methods

4.1. Dataset

The ASVspoof 2019 challenge provides a standard database for anti-spoofing [27]. The LA subset of the provided dataset includes a real voice and different types of TTS and VC spoofing attacks. The training and development sets share the same six attacks (A01–A06), including four TTS and two VC algorithms. In the evaluation set, there are 11 unknown attacks (A07–A15, A17, and A18), including combinations of different TTS and VC attacks. The evaluation set also includes two attacks (A16 and A19) that use the same algorithms as the two attacks in the training set (A04 and A06) but are trained using different data.

When training the voice spoof detection model, we used the LA subset’s training, development, and validation sets in ASVspoof 2019 as the training, validation, and test sets for our experiments.

We slice and dice the LA subsets when training the voice spoof detection model. Because A04 and A16 use the same algorithm and A06 and A19 use the same algorithm, we grouped A04 and A16 into class A04 and A06 and A19 into class A06. Subsequently, all classes A1 to A17 were divided into a training set, validation set, and evaluation set at a ratio of 3:1:1.

4.2. Training Details

A 60-dimensional LFCC was extracted from the voice with a frame size of 320 ms and a hop count of 160 ms. We extracted 4 s of length for each voice segment and repeated padding for those less than 4 s long to form batches.

4.2.1. Training Parameters for Voice Spoof Detection

Four sets of voice spoof detection experiments were conducted. The experiments were all based on the layered convolution neural network (LCNN) architecture, which takes the extracted LFCC features as input and outputs a confidence score to characterize the classification results. We set α = 20 and m = 0.9 for AM-Softmax for the hyperparameters in the loss function and α = 20,

m_{0}

= 0.9, and

m_{1}

= 0.2 for OC-Softmax. We used the adam optimizer [28] with the

β_{1}

parameter set to 0.9, and the

β_{2}

parameter was set to 0.999 to update the weights in the model. We used a stochastic gradient descent (SGD) optimizer [29] for the parameters in the loss function. The batch size was set as 16. The learning rate was set at 0.0001. The number of epochs was set to 50. The early stop time was set at 5.

4.2.2. Training Parameters for Spoof Algorithm Recognition

For the domain voice spoof algorithm recognition, we used ResNet-44 for voice feature extraction. For each category, 200 samples were stored, and each training step consisted of 20 epochs. The learning rate started at 0.001 and was divided by 10 after 15 epochs. The model parameters were optimized using a SGD optimizer, and the batch size was set to 64. In performing the incremental learning experiments, we used A01–A09 as the base category. Subsequently, two new classes of spoof methods are added on top of this one at a time. The weight size in the SGD optimizer was set according to the ratio of the stored instance size to the size of the new class dataset.

4.2.3. Evaluation Metrics

To evaluate the performance of the voice spoofing detection module, we recorded the output score of the voice spoofing detection module, called the countermeasure (CM) score. This indicates the similarity between a given voice and a real voice. The equal error rate (EER) was calculated by setting a threshold on the CM decision score such that the false alarm rate was equal to the miss rate. The calculations were as follows:

P_{f a} (θ) = \frac{# {s p o o f v o i c e w i t h s c o r e > θ}}{# \{T o t a l s p o o f v o i c e\}}

(14)

P_{m i s s} (θ) = \frac{# \{h u m a n v o i c e w i t h s c o r e \leq θ\}}{# \{T o t a l h u m a n v o i c e\}}

(15)

where

P_{f a} (θ)

and

P_{m i s s} (θ)

are monotonically decreasing and increasing functions of

θ, respectively

. EER corresponds to the threshold

θ_{E E R}

at which the two detection error rates are equal to 5, i.e., EER =

P_{f a} (θ_{E E R})

=

P_{m i s s} (θ_{E E R})

. The lower the EER, the better the voice spoofing detection module detects spoofing attacks.

The tandem detection cost function (t-DCF) [30] is a new evaluation metric adopted for the ASVspoof 2019 challenge. Although EER only evaluates the performance of the voice spoofing detection module, it assesses the influence of the voice spoofing detection module on the reliability of an ASV system. The ASV system was fixed to compare the different voice spoofing detection modules. The lower the t-DCF, the better the reliability of the ASV.

For the voice spoof algorithm recognition module, we recorded the accuracy of each iteration and the recognition accuracy of each type of algorithm for the last iteration. The higher the accuracy rate, the better the recognition performance.

4.3. Experimental Results

4.3.1. Results of Spoof Detection

To demonstrate the effectiveness of the one-class learning and domain-adaptive methods, we compared our proposed OC-Softmax and domain-adaptive methods with the traditional binary classification loss function under the same input features and model settings. They were also compared with the EER and min t-DCF metrics of the baseline model LFCC-GMM, officially provided by the ASVspoof2019 competition, as shown in Table 1. In Table 1, the OC-Softmax and domain adaptive methods obtained the best results, with equal error rate EER metrics of 2.530 and 2.638%, respectively. Under the metric min t-DCF, the OC-Softmax method obtained the best result of 0.0682. This indicated that, based on the strong generalization ability of the one-class loss function in the model inference stage, it could still recognize the truth and falsity of the voice spoofing of unknown attack types in the test set.

To further analyze the performance of each method, we obtained the distribution of the CM scores in the four sets of experiments, as shown in Figure 6. The comparison shows a considerable overlap between the bona fide and spoof distributions in Figure 6a, which indicates that the AM-Softmax-trained model has low generalization and a high misspecification rate. This overlap is almost invisible in Figure 6b–d. However, as can be seen from the horizontal coordinates, the difference between the mean CM scores of the bona fide and spoof samples in Figure 6b is approximately 1. The difference between the mean CM scores of the bona fide and spoof samples in Figure 6c is approximately 1.5. The difference between the mean CM scores of the bona fide and spoof samples in Figure 6d is approximately 2. The more significant the difference between the mean CM scores of the bona fide and spoof samples, the stronger the model’s ability to distinguish between true and spoofing voices and the higher the generalizability. Therefore, we can consider the generalization ability for voice spoof detection: OC-Softmax > domain adaptation > Sigmoid > AM-Softmax.

4.3.2. Results of Spoof Algorithm Recognition

The results of incremental learning using iCaRL and prototyping on the ASVspoof 2019 LA dataset are summarized in Table 2. Comparing the second and third rows of the table, we see that the weights can effectively mitigate the adverse effects of data imbalance and improve the model’s classification performance. Both the iCaRL and prototype methods in Table 2 use the same ResNet-44 for feature extraction of the LFCC of the voices, which is later fed into the classifier for classification. Comparing the first and third rows in Table 2, we see that the prototype method performed better in the incremental experiments.

We observed high-dimensional features of the prototypically trained model using t-distribution Stochastic Neighborhood Embedding (t-SNE) plots, as shown in Figure 7. Figure 7a shows that the various types of voice spoofing sample features are in high-dimensional spaces with compact intra-classes and large class spacing. This makes the model highly generalizable for forgery algorithm recognition. From Figure 7b, we can see that the features of the untrained voice spoofing samples are primarily distributed in low-density regions and are separated from the trained class sample features. This indicates that, after the new classes are added, we can still classify the old classes using the prototypes, and the appearance of the new classes will not have an excessive impact on the old classes. Therefore, the prototype is suitable for incremental voice spoof detection.

After the last training iteration, the accuracy of the model in identifying various spoof algorithms in the test set is presented in Table 3. It can be seen that the model has a high recognition ability for most forgery algorithms (accuracy higher than 80%), whereas the classification performance for forgery attacks A4, A7, and A10 is relatively poor. From the confusion matrix shown in Figure 8, we can see that the three attacks A4, A7, and A10 will misclassify each other, which we believe is because of the voice spoofing generated by these three spoof algorithms having very similar characteristics. Incremental learning is a deep learning method in scenarios where there is a lack of data, which leads to the model’s poor classification performance for these three attack methods.

5. Conclusions

This study designed a voice spoofing detection and spoofing algorithm recognition system. The system can detect the authenticity of a voice and accomplish the recognition of voice spoof algorithms under continuous data streams, which is not available in other voice spoofing detection systems. This helps prevent synthetic voice spoofing attacks, e.g., phone scams, and thus protects the privacy and security of individuals in society. The system is divided into voice spoof detection and spoof algorithm recognition. The voice spoof detection module learns a real voice distribution compactly using a one-class loss function. In contrast, the voice spoofing exists in the external feature space at a certain angle. This makes the system highly robust for voice spoof detection compared to binary classification methods. The voice spoof algorithm recognition module learns multiple prototypes in the feature space for each spoof algorithm type. It is then applied to classification in incremental scenarios. The experimental results show that our voice spoof detection module has a significantly improved performance compared to the baseline, traditional loss function, and adaptive domain methods. The voice spoof algorithm recognition module outperforms iCaRL. Since our system is oriented at practical applications, there may be a problem of excessive system resource usage in small devices. For this reason, we will further lighten the system according to different usage scenarios while ensuring its detection performance in the next work.

Author Contributions

Conceptualization, J.G.; Methodology, Y.Z.; Software, H.W.; Validation, J.G.; Formal analysis, J.G.; Investigation, J.G.; Resources, H.W.; Data curation, Y.Z.; Writing—original draft, Y.Z.; Writing—review & editing, H.W.; Visualization, Y.Z.; Supervision, H.W.; Project administration, J.G.; Funding acquisition, J.G. All authors have read and agreed to the published version of the manuscript.

Funding

National Natural Science Foundation of China: 61806218.

Data Availability Statement

The used publicly archived dataset is available at https://datashare.ed.ac.uk/handle/10283/3336.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wu, Z.; Evans, N.; Kinnunen, T.; Yamagishi, J.; Alegre, F.; Li, H. Spoofing and countermeasures for speaker verification: A survey. Speech Commun. 2015, 66, 130–153. [Google Scholar] [CrossRef] [Green Version]
Todisco, M.; Wang, X.; Vestman, V.; Sahidullah; Delgado, H.; Nautsch, A.; Yamagishi, J.; Evans, N.; Kinnunen, T.H.; Lee, K.A. ASVspoof 2019: Future horizons in spoofed and fake audio detection. Proc. Interspeech. 2019, 2019, 1008–1012. [Google Scholar]
Kamble, M.R.; Sailor, H.B.; Patil, H.A.; Li, H. Advances in antispoofing: From the perspective of ASVspoof challenges. APSIPA Trans. Signal Inf. Process. 2020, 9, 21. [Google Scholar] [CrossRef] [Green Version]
Das, R.K.; Kinnunen, T.; Huang, W.-C.; Ling, Z.-H.; Yamagishi, J.; Yi, Z.; Tian, X.; Toda, T. Predictions of subjective ratings and spoofing assessments of voice conversion challenge 2020 submissions. arXiv 2020, arXiv:2009.03554. [Google Scholar]
Wu, Z.; Kinnunen, T.; Evans, N.; Yamagishi, J.; Hanilçi, C.; Sahidullah; Sizov, A. ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge. In Proceedings of the 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015; Volume 2015, pp. 2037–2041. [Google Scholar]
Kinnunen, T.; Sahidullah; Delgado, H.; Todisco, M.; Evans, N.; Yamagishi, J.; Lee, K.A. The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection. Proc. Interspeech. 2017, 2–6. [Google Scholar]
Zhang, C.; Yu, C.; Hansen, J.H. An investigation of deep-learning frameworks for speaker verification antispoofing. IEEE J. Sel. Topics Signal Process. 2017, 11, 684–694. [Google Scholar] [CrossRef]
Gomez-Alanis, A.; Peinado, A.M.; Gonzalez, J.A.; Gomez, A.M. A light convolutional GRU-RNN deep feature extractor for ASV spoofing detection. Proc. Interspeech. 2019, 2019, 1068–1072. [Google Scholar]
Wu, Z.; Das, R.K.; Yang, J.; Li, H. Light convolutional neural network with feature genuinization for detection of synthetic speech attacks. Proc. Interspeech. 2020, 2020, 1101–1105. [Google Scholar]
Khan, S.S.; Madden, M.G. A Survey of Recent Trends in One Class Classification. In Artificial Intelligence and Cognitive Science. AICS 2009. Lecture Notes in Computer Science; Coyle, L., Freyne, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; Volume 6206, pp. 188–197. [Google Scholar]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Hadsell, R. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA. 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Yang, H.M.; Zhang, X.Y.; Yin, F.; Liu, C.L. Robust Classification with Convolutional Prototype Learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3474–3482. [Google Scholar]
Zhang, Y.; Bai, G.; Li, X.; Curtis, C.; Chen, C.; Ko, R.K.L. PrivColl: Practical Privacy-Preserving Collaborative Machine Learning. In Lecture Notes in Computer Science, Proceedings of the Computer Security–ESORICS 2020, Guildford, UK, 14–18 September 2020; Chen, L., Li, N., Liang, K., Schneider, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12308, pp. 399–418. [Google Scholar]
Peng, X.; Huang, Z.; Zhu, Y.; Saenko, K. Federated adversarial domain adaptation. arXiv 2019, arXiv:1911.02054. [Google Scholar]
Bejiga, M.B.; Melgani, F. Gan-Based Domain Adaptation for Object Classification. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 1264–1267. [Google Scholar]
Paris, C.; Bruzzone, L. A sensor-driven domain adaptation method for the classification of remote sensing images. In Proceedings of the 2014 IEEE Geoscience and Remote Sensing Symposium, Quebec City, QC, Canada, 13–18 July 2014; pp. 185–188. [Google Scholar]
Alegre, F.; Amehraye, A.; Evans, N. A one-class classification approach to generalised speaker verification spoofing countermeasures usinglocal binary patterns. In Proceedings of the 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS), Arlington, VA, USA, 29 September–2 October 2013. [Google Scholar]
Zhang, Y.; Jiang, F.; Duan, Z. One-Class Learning Towards Synthetic Voice Spoofing Detection. IEEE Signal Process. Lett. 2021, 28, 937–941. [Google Scholar] [CrossRef]
Liu, C.-L.; Nakagawa, M. Evaluation of prototype learning algorithms for nearest-neighbor classifier in application to handwritten character recognition. Pattern Recognit. 2001, 34, 601–615. [Google Scholar] [CrossRef]
Sato, A.; Yamada, K. A formulation of learning vector quantization using a new misclassification measure. In Proceedings of the Fourteenth International Conference on Pattern Recognition, Brisbane, QLD, Australia, 20 August 1998; Volume 1, pp. 322–325. [Google Scholar]
Bonilla, E.; Robles-Kelly, A. Discriminative probabilistic prototype learning. arXiv 2012, arXiv:1206.4686. [Google Scholar]
Divvala, S.K.; Farhadi, A.; Guestrin, C. Learning Everything about Anything: Webly-Supervised Visual Concept Learning. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3270–3277. [Google Scholar]
Xiao, T.; Zhang, J.; Yang, K.; Peng, Y.; Zhang, Z. Error-driven incremental learning in deep convolutional neural network for large-scale image classification. In Proceedings of the 22nd ACM international conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 177–186. [Google Scholar]
Han, S.; Meng, Z.; Khan, A.S.; Tong, Y. Incremental boosting convolutional neural network for facial action unit recognition. Adv. Neural Inf. Process. Syst. 2017, 29, 109–117. [Google Scholar]
Wang, Z.; Kong, Z.; Changra, S.; Tao, H.; Khan, L. Robust High Dimensional Stream Classification with Novel Class Detection. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; pp. 1418–1429. [Google Scholar]
Saito, K.; Watanabe, K.; Ushiku, Y.; Harada, T. Maximum Classifier Discrepancy for Unsupervised Domain Adaptation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3723–3732. [Google Scholar]
Wang, X.; Yamagishi, J.; Todisco, M.; Delgado, H.; Nautsch, A.; Evans, N.; Sahidullah, M.; Vestman, V.; Kinnunen, T.; Lee, K.A.; et al. ASVspoof 2019: A large-scale public database of synthetized, converted and replayed speech. Comput. Speech Lang. 2020, 64, 101114. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
Kinnunen, T.; Lee, K.A.; Delgado, H.; Evans, N.; Todisco, M.; Sahidullah; Yamagishi, J.; Reynolds, D.A. t-DCF: A detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification. In Proceedings of the Speaker Odyssey 2018 The Speaker and Language Recognition Workshop, Les Sables d’Olonne, France, 26–29 June 2018; pp. 312–319. [Google Scholar]

Figure 1. The framework of the system.

Figure 2. Schematic of domain adaptation.

C_{1} and C_{2}

denote classification boundaries. The dashed ellipse indicates the source domain, and the solid ellipse indicates the target domain. (a) indicates maximizing the shaded area. (b) indicates that the classifier is maximized. (c) indicates that the divergence is minimized. (d) denotes the final optimization objective with more robust decision boundary.

Figure 2. Schematic of domain adaptation.

C_{1} and C_{2}

denote classification boundaries. The dashed ellipse indicates the source domain, and the solid ellipse indicates the target domain. (a) indicates maximizing the shaded area. (b) indicates that the classifier is maximized. (c) indicates that the divergence is minimized. (d) denotes the final optimization objective with more robust decision boundary.

Figure 3. Schematic of OC-Softmax embedding vector distribution. The red and blue dots indicate the two classes of sample features. The dashed line indicates the classification decision boundary. The arrow indicates the optimization direction of the target class embedding vector.

Figure 4. Schematic for prototype training. Different colored circles indicate different classes of prototypes. The size of the circle indicates the size of the value.

Figure 5. Schematic of incremental learning. The circle in the figure represents the features. Triangles, rectangles, and different colors represent category labels, sample data, and different categories, respectively.

Figure 6. CM score histogram: (a) AM-Softmax. (b) Sigmoid. (c) Domain adaptation. (d) OC-Softmax.

Figure 7. (a) Visualization of the t-SNE of the learned A1–A9 category features distribution. (b) The t-SNE visualization of the learned category A1–A9 and for the learned category A10–A17 features distribution t-SNE visualization. The numbers in the figure indicate the category numbers of the forgery methods.

Figure 8. Classification confusion matrix after the last iteration of training.

Table 1. Results of the evaluation set at ASVspoof2019 LA.

Method	EER (%)	Min t-DCF
LFCC-GMM	8.090	0.21160
LFCC-LCNN—AM-Softmax	4.647	0.08278
LFCC-LCNN—Sigmoid	2.869	0.07248
LFCC-LCNN—Domain adaptation	2.638	0.08844
LFCC-LCNN—OC-Softmax	2.530	0.0682

Table 2. The accuracy during the incremental learning iterations.

Method	0 Iters	1 Iters	2 Iters	3 Iters	4 Iters
Method	Avg (%)	Avg (%)	Avg (%)	Avg (%)	Avg (%)
iCaRL	99.03	87.97	84.86	83.20	76.58
prototype	98.78	85.38	84.51	82.02	68.77
prototype + weight	98.78	87.17	85.31	88.35	79.89

Table 3. Recognition results after the last iteration of training.

Attacks	Prototype + Weight
A1	87.30
A2	99.73
A3	99.89
A4	49.56
A5	92.65
A6	81.36
A7	62.63
A8	95.97
A9	98.09
A10	55.20
A11	89.60
A12	88.18
A13	92.71
A14	98.44
A15	92.14
A16	98.09
A17	99.93

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, J.; Zhao, Y.; Wang, H. Generalized Spoof Detection and Incremental Algorithm Recognition for Voice Spoofing. Appl. Sci. 2023, 13, 7773. https://doi.org/10.3390/app13137773

AMA Style

Guo J, Zhao Y, Wang H. Generalized Spoof Detection and Incremental Algorithm Recognition for Voice Spoofing. Applied Sciences. 2023; 13(13):7773. https://doi.org/10.3390/app13137773

Chicago/Turabian Style

Guo, Jinlin, Yancheng Zhao, and Haoran Wang. 2023. "Generalized Spoof Detection and Incremental Algorithm Recognition for Voice Spoofing" Applied Sciences 13, no. 13: 7773. https://doi.org/10.3390/app13137773

APA Style

Guo, J., Zhao, Y., & Wang, H. (2023). Generalized Spoof Detection and Incremental Algorithm Recognition for Voice Spoofing. Applied Sciences, 13(13), 7773. https://doi.org/10.3390/app13137773

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Generalized Spoof Detection and Incremental Algorithm Recognition for Voice Spoofing

Abstract

1. Introduction

2. Related Work

2.1. Domain Adaptation

2.2. One-Class Learning Method

2.3. Prototype Learning

2.4. Incremental Learning

3. Voice Spoof Detection and Algorithm Recognition System

3.1. Generalized Voice Spoof Detection

3.1.1. Domain Adaptation Based on Decision Boundary Maximization

3.1.2. Domain Adaptation Based on Decision Boundary Maximization

3.2. Incremental Spoof Algorithms Recognition

3.2.1. Prototype Learning

3.2.2. Incremental Learning Module

4. Experimental Methods

4.1. Dataset

4.2. Training Details

4.2.1. Training Parameters for Voice Spoof Detection

4.2.2. Training Parameters for Spoof Algorithm Recognition

4.2.3. Evaluation Metrics

4.3. Experimental Results

4.3.1. Results of Spoof Detection

4.3.2. Results of Spoof Algorithm Recognition

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI