GANBA: Generative Adversarial Network for Biometric Anti-Spooﬁng

: Automatic speaker veriﬁcation (ASV) is a voice biometric technology whose security might be compromised by spooﬁng attacks. To increase the robustness against spooﬁng attacks, presentation attack detection (PAD) or anti-spooﬁng systems for detecting replay, text-to-speech and voice conversion-based spooﬁng attacks are being developed. However, it was recently shown that adversarial spooﬁng attacks may seriously fool anti-spooﬁng systems. Moreover, the robustness of the whole biometric system (ASV + PAD) against this new type of attack is completely unexplored. In this work, a new generative adversarial network for biometric anti-spooﬁng (GANBA) is proposed. GANBA has a twofold basis: (1) it jointly employs the anti-spooﬁng and ASV losses to yield very damaging adversarial spooﬁng attacks, and (2) it trains the PAD as a discriminator in order to make them more robust against these types of adversarial attacks. The proposed system is able to generate adversarial spooﬁng attacks which can fool the complete voice biometric system. Then, the resulting PAD discriminators of the proposed GANBA can be used as a defense technique for detecting both original and adversarial spooﬁng attacks. The physical access (PA) and logical access (LA) scenarios of the ASVspoof 2019 database were employed to carry out the experiments. The experimental results show that the GANBA attacks are quite effective, outperforming other adversarial techniques when applied in white-box and black-box attack setups. In addition, the resulting PAD discriminators are more robust against both original and adversarial spooﬁng attacks. analysis, A.G.A., J.A.G.-L. and A.M.P.; inves-tigation, A.G.A., J.A.G.-L. and A.M.P.; A.G.A., J.A.G.-L. data curation, A.G.A. J.A.G.-L.; writing—original preparation, A.G.A., J.A.G.-L. and A.M.P.; writing—review


Introduction
Biometric authentication [1] aims to authenticate the identity claimed by a given individual based on the samples measured from biological characteristics (e.g., voice, face, and fingerprints). In recent years, however, automatic speaker verification (ASV) technology has shown vulnerability to security attacks where impostors try to fraudulently access the system by inputting speech similar to the voice of a genuine user [2,3]. These security threats for voice biometric systems are known as spoofing attacks.
Four types of spoofing attacks were identified by the scientific community [4]: (i) replay (i.e., using a pre-recorded voice of the target user), (ii) impersonation (i.e., mimicking the voice of the target voice), or either using (iii) text-to-speech synthesis (TTS) or (iv) voice conversion (VC) systems to generate artificial speech resembling the voice of a genuine user. Moreover, these attacks can be presented to the ASV system using either logical access (LA) or physical access (PA) scenarios. In the LA scenario, the sensor is by-passed and TTSor VC-based attacks are directly injected into the ASV system. In the PA attack scenario, the replayed spoofing signal is presented to or captured by the sensor, i.e., the microphone.
Anti-spoofing or presentation attack detection (PAD in ISO/IEC 30107 nomenclature [5]) for voice biometrics has gained increased attention in recent years as shown by the organization of multiple evaluation challenges: (i) ASVspoof 2015 [6], which focused on More recently, different investigations [15][16][17] showed that anti-spoofing systems are also vulnerable to adversarial attacks [18]. This type of attack perturbs benign samples, normally in a way imperceptible to humans, which are able to fool deep neural network (DNN)-based models [19]. There are two main categories of adversarial attacks: (i) white-box adversarial attacks, and (ii) black-box adversarial attacks. In this paper, those adversarial attacks where the attacker has all the information about the victim model (i.e., its artifacts, weights, and model architecture) will be referred to as white-box attacks. Similarly, the term black-box is used to indicate those attacks where the attacker does not have knowledge about the victim model but the attacker can repeatedly query it to obtain a student model (surrogate) of the teacher (victim). In order to do this, the attacker uses the binary outputs provided by the victim model (acceptance/rejection) as ground-truth labels.
The main contributions proposed in this work are summarized in the following: • We study the robustness of the complete voice biometric system against adversarial spoofing attacks. • We also propose a novel generative adversarial network for biometric anti-spoofing (GANBA) which generates adversarial spoofing attacks capable of fooling the PAD system without being detected by the ASV system, i.e., without changing the speaker information of the utterance. Moreover, while our previous work [15] was focused on adversarial attack generation, here we also train the PAD discriminator so that it provides us with a reinforced defense against adversarial and even original spoofing attacks. • To the best of our knowledge, adversarial spoofing attacks have only been studied on TTS and VC spoofing attacks (LA scenarios). In this paper, replay spoofing attacks (PA scenarios) are also considered.
The rest of this paper is organized as follows. Section 2 outlines the PAD, ASV, and complete voice integration systems, as well as the metrics to evaluate them. In addition, adversarial spoofing attacks as well as some well-known adversarial attacks, employed as baselines in this work, are discussed. The GANBA framework for both white-box and black-box scenarios are proposed in Section 3. Then, Section 4 describes the systems details, speech corpora, and the metrics evaluated in the experiments. Section 5 discusses the experimental results. Finally, Section 6 summarizes the conclusions derived from this research.

Background
This section is devoted to briefly describe the existing standalone PAD and ASV approaches, as well as the complete voice integration systems including the metrics to evaluate all of them. Moreover, Section 2.4 provides a detailed description of adversarial spoofing attacks.

Automatic Speaker Verification (ASV)
An ASV system is able to determine whether an utterance is uttered by the claimed speaker or not. In order to do it, it typically obtains the speaker information of the utterance by extracting either i-vector [20] or x-vector features [21]. In the verification phase, the ASV system extracts the feature vectors of the enrollment and test utterances, and they are usually mapped into a more discriminative subspace using, for example, linear discriminant analysis (LDA). Then, the ASV score of the test utterance is typically obtained using one of the following techniques: • Probabilistic Linear Discriminant Analysis (PLDA) [22,23]: it is a probabilistic framework which is able to model the inter-and intra-speaker variability. There are three types of PLDA models [24]: simplified [25], standard [22], and two-covariance [26]. In all variants, the expectation-maximization (EM) algorithm [27] is used to train the PLDA model. • B-vector [28]: it is a DNN-based model which considers ASV as a binary classification problem. Specifically, from the x-vectors x enroll and x test computed for each pair of utterances (enrollment and test utterances), a b-vector representing the relationship between x enroll and x test is computed as follows, where ⊕, and ⊗ are the element-wise addition, subtraction and multiplication operations, respectively. Then, the b-vector features are fed to a binary DNN which determines whether the enrollment and test utterances are uttered by the same or different speaker.
An ASV system is typically evaluated on a test dataset which contains utterances uttered by either bonafide target speakers or zero-effort impostors [29]. The equal error rate (EER) is the most common metric to evaluate it, which is the operating point at which the false rejection rate (FRR) equals the false acceptance rate (FAR). However, the EER metric does not account for either the costs of falsely accepting impostors or missing target users, nor the prior probabilities of each. In order to take these costs and priors into account, the detection cost function (DCF) metric [30] has been proposed and evaluated in the most popular speaker recognition challenges [31].

Anti-Spoofing or Presentation Attack Detection (PAD)
The goal of anti-spoofing is to differentiate between bonafide and spoofing speech. Two hypotheses are computed for each test utterance: (i) it is bonafide speech, or (ii) it is a spoofing attack.
In the last ASVspoof challenges [8][9][10], DNN-based models have been the most effective approach to differentiate between bonafide and spoofing speech. A wide range of features have been proposed for training these models, such as linear frequency cepstral coefficients (LFCC) [32], spectrograms [33], constant Q cepstral coefficients (CQCC) [34], and raw speech samples [35].
Anti-spoofing systems are typically evaluated using the ERR PAD metric, where false acceptance happens when a spoofing utterance is detected as a bonafide utterance while false rejection occurs when a bonafide utterance is detected as a spoofing attack. Moreover, in order to take costs and priors of the different hypotheses into account, the ASV-constrained minimum tandem detection cost function (min-tDCF) metric [36] has been recently proposed to evaluate anti-spoofing systems. This has been the primary metric of the ASVspoof 2019 and 2021 challenges [9,10].

Voice Integration Systems: Joint ASV and PAD
In the integration approach, each utterance has two attributes: (i) an indicator of the target speaker (S), and (ii) an indicator of the bonafide speech (N ). Therefore, the null hypothesis H (S,N ) is that the test utterance is bonafide speech uttered by the target speaker. On the other hand, the complementary hypotheses is a union of the other three hypotheses: where (S, N ) denotes a spoofing attack, (S, N ) represents bonafide speech uttered from a non-target speaker (i.e., it is zero-effort impostor), and (S, N ) represents spoofing speech from a non-target speaker. The latter case, commonly referred to as naive attack, does not make much sense in an authentication context and it is usually discarded. Then, there are three types of utterances that PAD and ASV systems may encounter: (i) bonafide or genuine target, (ii) zero-effort impostor or genuine non-target, and (iii) spoofing target attacks. The integration of PAD and ASV systems can be achieved at the score level [37] or at the feature level [11,38]. Most existing integration methods perform the integration at the score level, where dedicated classifiers are developed for both PAD and ASV separately, and the scores computed by each separate system are combined. In this work, we focus at the score-level integration. Specifically, we use the cascaded integration system depicted in Figure 1 which has been used in the last three ASVspoof challenges [8][9][10].
The integration systems are typically evaluated using the EER joint which can be measured, for example, on a test dataset that contains a combination of bonafide utterances, zero-effort attacks and spoofing attacks. However, the EER joint does not account for the costs of falsely accepting spoofing attacks and zero-effort impostors or missing target genuine users, nor the prior probabilities of each. In order to take these costs and priors into account, the min-tDCF metric [36,39] was recently proposed for evaluating complete voice integration systems based on score-level decisions.

Adversarial Spoofing Attacks
Adversarial spoofing attacks can be generated by adding a minimally perceptible perturbation to the input spoofing utterance. The core idea of this type of adversarial attack is to refine the original spoofing attack so that it is more difficult to be detected by the PAD system. In other words, the goal of adversarial spoofing attacks is to fool the anti-spoofing or PAD system by maximizing the bonafide class likelihood with respect to that of the spoofing class.
Adversarial spoofing attacks can be generated by freezing the parameters θ of the DNN-based anti-spoofing model and performing a gradient descent algorithm which is able to update the input spectrum features X of the spoofing utterance so that the PDA misclassifies it as bonafide. Mathematically, it is an optimization problem which tries to find a sufficiently small perturbation δ which satisfies: whereX denotes the perturbed input spectrum features, f denotes a well-trained DNNbased anti-spoofing model parameterized by θ, δ is the additive spectrum perturbation, y is the ground-truth label corresponding to the original input spectrum features X, andỹ denotes the label of the targeted class of the adversarial spoofing attack, i.e., the label of the bonafide class. Normally, the subspace of allowed perturbations is denoted by ∆, so that the perturbation δ ∈ ∆. In other words, the manipulative capability of the adversarial attack is formalized by the subspace ∆.
Two of the most popular adversarial attack techniques are: (i) gradient sign method (FGSM) [40], and (ii) projected gradient descent (PGD) [41]. The FGSM attack generates the perturbation δ by taking a single step toward the direction pointed by the gradient as follows, where sign is an operation which takes the sign of its argument, and Loss is the loss function of the well-trained PAD neural network whose parameters are denoted by θ. Unlike FGSM, implemented as a single-step procedure, PGD is iterative. Thus, initializing with the original input spoofing spectrum features X 0 = X, the spectra of the spoofing attack is iteratively updated as follows, where n = 0, ..., N − 1 denotes the iteration index (up to N iterations), and clip denotes a function which applies element-wise clipping such that X n − X ∞ ≤ , ≥ 0 ∈ R.

Proposed Method
In this work, we propose a generative adversarial network for biometric anti-spoofing (GANBA) in order to generate adversarial spoofing attacks and, at the same time, train the PAD discriminator in order to make it more robust against this type of attack. The generator of the proposed GANBA is a neural network which is in charge of transforming the original input spoofing spectrum features into adversarial spoofing spectrum features against a target voice biometric system. Thus, the discriminator of the GANBA is a complete voice biometric system (ASV + PAD) which tries to differentiate between bonafide and spoofing speech (PAD system), and verify the identity of the enrolled speakers (ASV system).
The PAD and ASV models of the proposed GANBA provides either a probability distribution across the bonafide and spoofing class labels (white-box scenario) or just a binary decision indicating whether the test utterance is accepted or rejected by the biometric system (black-box scenario). In both scenarios, the goal of the proposed GANBA generator is to provide high quality adversarial spoofing attacks from spoofing speech able to fool the anti-spoofing system while undetected by the ASV subsystem (that is, the speaker information contained in the utterance is not modified). In contrast, the objective of the GANBA discriminator (complete voice biometric system) is to detect both the original (We refer to the original spoofing attacks as those unseen spoofing attacks of the test dataset which are not modified by any adversarial perturbation) and adversarial spoofing attacks. Figure 2 depicts the proposed GANBA architecture for the white-box scenario. The inputs to the GANBA generator are the short-time Fourier transform (STFT) features of a spoofing utterance, so that it modifies its spectra in order to refine the spoofing attack. The output of the GANBA generator is fed to the PAD and ASV subsystems of the target biometric system. The ASV system only consists of a time-delay neural network (TDNN) [21] for xvector extraction (the only component of the ASV system needed in the white-box scenario). This feature extractor is fed with the Mel-frequency cepstral coefficients (MFCCs) of the corresponding utterance obtained through the log-power magnitude spectrum features (STFT) extracted previously, as shown in the diagram of Figure 2. On the other hand, the PAD system based on DNNs is also fed with the STFT features of the corresponding utterance and it provides the softmax output vector of the utterance, whose first component indicates the probability of the utterance being bonafide. The goal of the proposed framework is to train a GANBA generator capable of generating adversarial spoofing attacks which can fool the anti-spoofing system while not causing any changes to the ASV x-vector output, i.e., the adversarial attacks should not change the feature x-vector since it contains the speaker information of the utterance. Step 1: generator-only training (ASV and PAD parameters frozen, with encircled outputs corresponding to Equations (7) and (8), respectively).
As is shown in Figure 2, there are two different steps for training the whole architecture.
Step 1 denotes the training of the GANBA generator, where the PAD and ASV parameters are not modified but gradients are computed and back-propagated to the GANBA generator. At this stage, the spoofing speech dataset is used only to train the GANBA generator. On the other hand, Step 2 denotes the training of the biometric system, PAD and ASV components, which makes up the discriminator of the GANBA framework. While the TDNN of the ASV system is trained using only the bonafide speech dataset, the PAD system is trained using both the bonafide and spoofing speech datasets.
Step 1 is in charge of optimizing the GANBA generator parameters in the white-box (w-box) scenario. In particular, the following loss function is minimized: where L ASV_GANBA_w-box and L PAD_GANBA_w-box are the loss functions associated with the ASV and PAD systems, respectively, and β is a hyper-parameter which weights the relative importance of these two losses. x spoof vector andx spoof vector denote the x-vectors of the original and adversarial spoofing utterances, respectively. Likewise, vectors s spoof PAD ands spoof PAD represent the output probability sets provided by the PAD system for the original and adversarial spoofing utterances, respectively. Moreover, r α is a re-ranking function which is formulated as follows, where k is the index class variable of the s spoof PAD probability vector, with k = 0 representing the class of bonafide speech, α > 1 is an additional hyper-parameter which defines how large s spoof PAD k = 0 (i.e., the bonafide class probability) is with respect to the current maximum probability class, and norm denotes a normalizing function which provides a valid probability distribution.
On the other hand, Step 2 indicates the training of the voice biometric system (ASV + PAD) which acts as the GAN discriminator. The ASV system based on a TDNN is trained as a classifier using only the bonafide speech dataset. Likewise, the PAD system is also trained as a classifier using both the spoofing and bonafide speech datasets. Finally, Step 1 and Step 2 follow each other to train the proposed GANBA as a generative adversarial network (GAN), where the GANBA generator is in charge of producing the adversarial spoofing attacks and is trained during Step 1, while the PAD and ASV discriminators are in charge of detecting the voice biometric attacks and are trained during Step 2 as normal ASV and PAD systems using their corresponding loss functions [2,21]. Figure 3 depicts the proposed GANBA architecture for the black-box scenario. Similar to the white-box case, the goal of the proposed system is the generation of adversarial spoofing attacks capable of fooling the target PAD system (teacher PAD) and bypassing the target ASV system (teacher ASV) by not modifying the x-vector representation which encodes the speaker information of the utterance. However, the main limitation of the black-box scenario is that the attacker does not have access to the target system (teacher) parameters. Step 1: generator-only training (ASV and PAD parameters frozen, with encircled outputs corresponding to Equations (11) and (12), respectively).
In the proposed GANBA framework, the attacker makes requests to the black-box target (teacher) biometric system but only obtains a binary decision response of acceptance/rejection. The binary response is considered to be the ground-truth label for training the student PAD and b-vector [28] systems of the GANBA discriminator. It is worth noticing that we assume that a rejection of the teacher system is made by the teacher PAD system since we make sure that the x-vector representation of the utterance is minimally perturbed by the adversarial spoofing attack. Thus, the student b-vector and PAD systems are trained as binary classifiers able to imitate the behavior of the teacher ASV and PAD systems, respectively. In particular, the student b-vector system computes the probability that the two input x-vectors represent the same speaker, i.e., that P(b(x vector ,x vector ) = 1), with b denoting the b-vector model.
As shown in Figure 3, there are again two different steps for training the whole architecture.
Step 1 denotes again the training of the GANBA generator, where the ASV and PAD network parameters are not modified but the gradients are back-propagated to the GANBA generator. Thus, Step 1 is in charge of optimizing the parameters of the GANBA generator in the black-box (b-box) case by minimizing the following loss function: where L ASV_GANBA_b-box and L PAD_GANBA_b-box are the loss components associated with the ASV and PAD systems, respectively. Furthermore, onehot is the one-hot function [42] and k = 0 is the bonafide class index. Using this function, the input spoofing utterance is presented as bonafide and the PAD is fooled. On the other hand, Step 2 denotes the training of the student biometric system (ASV b-vector + PAD) which acts as the discriminator of the GAN. The TDNN employed for x-vector extraction is pretrained and its parameters are also frozen in this step. However, the b-vector system is trained as a binary classifier [28] employing the test and enrollment utterances from the bonafide speech dataset. Likewise, the PAD system is also trained as a binary classifier using both bonafide and spoofing utterances. In both cases, the ground truth labels are taken from the binary responses of the black-box target/teacher biometric system. Similar to the white-box scenario, Step 1 and Step 2 follow each other in order to train the proposed GANBA as a typical GAN.

Experimental Setup
In this section, we describe the databases, spectral analysis, implementation details, and evaluation metrics employed in the experiments.

Speech Datasets
We used the ASVspoof 2019 corpus [43] to train and evaluate all the systems. This database is split into two subsets to allow PA and LA evaluation. Moreover, it does not only include protocols for evaluating PAD systems, but also for evaluating ASV and ASV+PAD integration systems. First, we employed this database for training the standalone antispoofing systems in the PA and LA scenarios, respectively. Then, we only used the spoofing utterances for generating adversarial spoofing attacks in order to bypass the complete voice biometric system. It is worth noticing that the adversarial examples were not generated from bonafide utterances because we consider that doing so they would lose their bonafide character.
To train the TDNN [21] of the ASV system as an x-vector features extractor, we also employed the Voxceleb1 database [44] which contains more than 1000 speakers. Moreover, in order to train the b-vector [28] ASV scoring system in the black-box scenario, the bonafide utterances (from the ASVspoof 2019 and Voxceleb1 development datasets) were used, thus following the training protocol described in [11].

Spectral Analysis
The PAD systems were fed with log-power short-time Fourier transform (STFT) features with 256 frequency bins and 600 acoustic frames. In order to obtain the STFT features, a Hanning analysis window with a 10 ms frame shift and 25 ms of window length was employed. On top of these STFT features, 24 MFCCs (including the C0 cepstral coefficient), obtained with the Kaldi recipe [45], were extracted to feed the TDNN-based ASV system.

Implementation Details
Three state-of-the-art anti-spoofing systems were adapted from other works: a light convolutional neural network (LCNN) [2], a residual neural network (ResNet34) [46] and a Squeeze-Excitation network (SENet50) [46]. The softmax layer output of the DNN-based models was directly used to obtain the PAD scores. For ASV, a TDNN model for x-vector feature extraction [21] was trained. Then, two ASV scoring systems were trained: (i) a standard PLDA [22]; and (ii) a b-vector system [28].
The generator of the proposed GANBA framework is a convolutional neural network (CNN) with five convolutional layers (16,32,48,48, and 3 channels). Furthermore, it uses a kernel size of 3 × 3 as well as leaky ReLU activations. The Adam optimizer [47] (with learning rate 3 × 10 −4 ) was used to train the GANBA generator. Moreover, a grid search across the development dataset of the ASVspoof 2019 database was used in order to find the best empirical values of the hyper-parameters α = 10 and β = 0.001. All the deep learning frameworks were trained using the Pytorch toolkit [48].
The PGD method uses N = 30 training iterations for generating the adversarial spoofing attack, and for evaluation the number of iterations is set to N = 100 [49]. The magnitude of the perturbation is configured with the parameter. In the experiments, we do a grid search between = 0.1 and = 5.0 in order to find the optimal perturbation of the FSGM and PGD techniques. However, the magnitude of the GANBA perturbation is not restricted to any specific value. This is one of the main advantages of the proposed GANBA technique since it is in charge of finding the optimal perturbation value by itself in order to fool the PAD system without being detected by the ASV system, i.e., without changing the speaker information of the utterance.

Evaluation Metrics
A specific EER (EER ASV ) was used for ASV. We evaluated this metric either including only bonafide utterances or including both bonafide and spoofing utterances. Likewise, the PAD systems were also evaluated using the EER PAD across all spoofing attacks. To compute the performance of the complete voice integration system, any utterance rejected by either the ASV or PAD systems was arbitrarily assigned a −∞ score. Finally, the integration systems were evaluated using the min-tDCF [39] metric and the joint EER (EER joint ) with the same configuration as that of the ASVspoof 2019 challenge [9]. The ASVspoof 2019 test datasets were used to evaluate all the ASV, PAD and complete voice integration systems.

Results
This section presents the experimental results from the evaluation of the described techniques on the ASVspoof 2019 corpus. First, Section 5.1 presents the results of different biometric systems without being exposed to any adversarial spoofing attacks. Then, Section 5.2 evaluates the vulnerability of a biometric system to white-box adversarial spoofing attacks. Likewise, Section 5.2 is devoted to the evaluation of the black-box adversarial spoofing attacks, where the details of the target biometric system remain unknown to the attacker. In both Sections 5.2 and 5.3, the proposed white-box and black-box GANBA attacks will be compared to other classical adversarial spoofing attacks, respectively. Finally, Section 5.4 presents the results of the biometric system after applying two defense techniques: (i) adversarial training of the PAD discriminator using the generated adversarial spoofing attacks, and (ii) using the PAD discriminator trained within the proposed GANBA framework. Table 1 presents the baseline results of six biometric systems which consist of the combination of three PAD (LCNN, SENet50 and ResNet34) and two ASV (TDNN + PLDA and TDNN + b-vector) systems. These biometric systems are not still exposed to any adversarial spoofing attacks. These PA and LA anti-spoofing systems have been shown to provide some of the best single PAD performance in the ASVspoof 2019 challenge [9]. The best biometric system is the combination of LCNN and TDNN + PLDA as the PAD and ASV systems, respectively. Although the ASV system provides EER of 6.87% and 4.71% over the PA and LA datasets, respectively, when evaluating exclusively with bonafide utterances (target and non-target), its performance meaningfully degrades when spoofing utterances are also evaluated (in particular, 18.43% and 30.58% in the PA and LA evaluation datasets, respectively). This TDNN + PLDA / LCNN biometric system will be used as the teacher system for the black-box scenario in Section 5.3. Figure 4 shows the EER joint of the best TDNN + PLDA/LCNN based biometric system evaluated in the previous section when being exposed to white-box adversarial attacks. The PGD technique as expected achieves slightly better results than the FGSM technique since PGD uses an iterative procedure for generating adversarial spoofing attacks. However, the proposed GANBA attacks outperform the other adversarial attacks, obtaining 20.94% and 27.63% higher absolute EER joint with respect to the best PGD configuration (i.e., = 1.0) in the PA and LA evaluation datasets, respectively. Another remarkable result is that using a hyper-parameter higher than 2.0 in PGD and FGSM, the perturbation is effectively detected by the biometric system. In such cases, the adversarial spoofing attacks may perform even worse than when only using the original spoofing attacks, i.e., when not generating adversarial attacks from spoofing speech (denoted by 'No Processing').  Figures 5 and 6 show the EER joint of the state-of-the-art TDNN + PLDA / LCNN biometric system for the black-box scenario when it is attacked with three different student biometric systems in the PA and LA scenarios, respectively. As shown in Figure 2, the ASV system of the attacker is a TDNN + b-vector system. On the other hand, each student biometric system uses a different PAD system: ResNet34, SENet50, and LCNN. Moreover, three types of adversarial spoofing attacks are employed by each student biometric system: FGSM ( = 1.0), PGD ( = 1.0) and the proposed GANBA attack. Table 1. Results of different complete voice biometric systems evaluated on the ASVspoof 2019 logical access (LA) and physical access (PA) test datasets in terms of EER PAD (%), EER ASV (%), EER joint (%) and min-tDCF. (*) The EER ASV metric is evaluated considering both bonafide and spoofing utterances. The proposed GANBA attack outperforms the best FGSM and PGD configurations ( = 1.0) by 27.67% and 27.08% in the LA scenario when using the LCNN PAD system. In the PA scenario, the proposed GANBA attack also outperforms them by 17.09% and 16.32% with the LCNN PAD system, respectively. It is worth noticing that the LCNN PAD system is always better than the ResNet34 and SENet50 since this system has the same architecture as the PAD system of the target (teacher) system. However, the SENet50 PAD system achieves only 3.44% and 3.09% of slower absolute EER joint when using the proposed GANBA attack compared to the LCNN PAD architecture, being able to effectively attack the teacher system achieving an EER joint of 50.12% and 40.12% in the LA and PA scenarios, respectively. This result shows how vulnerable a black-box complete voice biometric system can be to adversarial spoofing attacks. Figure 7 shows some examples of the spectrogram of bonafide speech, spoofing speech generated with a replay attack of the physical access ASVspoof 2019 database, adversarial spoofing speech refined by the PGD method, and that refined by the proposed GANBA technique. As can be seen, it is difficult to visually differentiate between the original spoofing spectrogram and the generated adversarial spectrogram with the proposed method. However, our technique is able to correct some of the spoofing artifacts so that the replay attack is introduced in the spectrogram so that the PAD system misclassifies the generated adversarial spoofing utterance as a bonafide utterance. In contrast, the PGD method is not able to compensate for those artifacts, and the PAD system still classifies it as spoofing speech.  Table 2 shows the performance metrics of the state-of-the-art TDNN + PLDA / LCNN biometric system when it applies two separate defense techniques: (i) adversarial training of the PAD discriminator using the generated adversarial spoofing attacks, and (ii) using the PAD discriminator trained within the proposed GANBA framework. Both defense techniques employ the black-box adversarial attacks generated with the SENet50 PAD system employed in Section 5.3 as discriminator. Thus, we can evaluate a realistic scenario where the target PAD system (LCNN) does not match the same architecture as that of the attacker PAD system (SENet50).  Table 2 is divided into three sections separated by horizontal lines. The first row shows the performance metrics of the biometric system without being exposed to any adversarial spoofing attacks for the sake of comparison with the rest of attack-defense combinations. The next nine rows show the performance metrics of the biometric system evaluated after applying a different attack-defense combination: (i) the attack technique is one of the three types of adversarial spoofing attacks (FGSM, PGD or GANBA) evaluated in this work; and (ii) the defense technique consists of applying adversarial training using the generated adversarial spoofing attacks of one of the three black-box methods (FGSM, PGD or GANBA) evaluated in Section 5.3 and generated with the system which employs the SENet50 PAD discriminator. On the other hand, the three last rows of Table 2 shows the performance metrics of the biometric system when it is attacked by one type of adversarial spoofing attack (FGSM, PGD or GANBA) and, at the same time, it is defended using the PAD discriminator trained with the proposed GANBA framework described in Section 3.2.

Defenses against Adversarial Spoofing Attacks
As shown in Table 2, the adversarial training defense technique, which uses the generated GANBA attacks, significantly outperforms the adversarial training defense technique, which uses either FGSM or PGD attacks, by more than 2% of absolute EER joint in both the PA and LA scenarios. However, the proposed GANBA adversarial spoofing attacks are not effectively detected when using adversarial training with either the FGSM or PGD attacks, since the EER joint is more than 34% in both the PA and LA scenarios. This result highlights the effectiveness of the proposed GANBA attacks. It is also very noticeable that adversarial training with the generated GANBA attacks is able to slightly improve the results of the baseline complete biometric system which is not exposed to any adversarial spoofing attacks. This can be due to the effect of data augmentation on the generated GANBA attacks which helps to detect even more original spoofing attacks of the ASVspoof 2019 evaluation dataset.
Nevertheless, the best defense technique is that of using the PAD discriminator trained with the proposed GANBA framework which outperforms the adversarial training defense technique in all cases. The usage of the resulting PAD system of the proposed GANBA framework is the best solution for defending the target biometric system. It is even able to significantly improve the results of the baseline biometric system which is not exposed to any adversarial spoofing attacks by 1.96% and 2.13% of absolute EER joint in the PA and LA scenarios, respectively. This means that the trained PAD discriminator with the proposed GANBA framework is not only helpful for detecting adversarial spoofing attacks but also for detecting the original spoofing attacks of the ASVspoof 2019 test dataset. Table 2. Results of the TDNN + PLDA/LCNN biometric system evaluated on the ASVspoof 2019 logical access (LA) and physical access (PA) test datasets in terms of EER PAD (%), EER ASV (%), EER joint (%) and min-tDCF. The FGSM and PGD attacks employs their best attack configuration ( = 1.0). (*) The EER ASV metric is evaluated considering both bonafide and spoofing utterances.

Conclusions
In this paper, the robustness of state-of-the-art complete (ASV+PAD) voice biometric systems against adversarial spoofing attacks was studied. Furthermore, we proposed a novel generative adversarial network for biometric anti-spoofing (GANBA) capable of fooling the anti-spoofing system without being detected by the ASV system, i.e., without changing the speaker information of the utterance. Furthermore, we employed the generated attacks for defending the system by either applying adversarial training or using the resulting PAD discriminator of the proposed GANBA framework.
Experimental results showed that voice biometric systems are highly vulnerable to adversarial spoofing attacks in both physical and logical access scenarios. However, we showed that the biometric system can be effectively defended using the PAD discriminator of the proposed GANBA system. In fact, the proposed defense technique resulted in being more robust against both adversarial and original spoofing attacks. It is worth noting that the results presented here with the GANBA attacks can be directly compared with those of the adversarial biometric transformation network (ABTN) attacks in [15]. This comparison shows that although ABTN and GANBA follow the same strategy for attack generation, GANBA goes a step further in being trained as a generative adversarial network (GAN), and shows a higher capability for fooling the voice biometric system.
In the future, we would like to explore a cross-database evaluation of the proposed defense technique for voice biometric systems against adversarial spoofing attacks in order to study their generalization between different datasets [50].  Institutional Review Board Statement: The study was conducted according to the guidelines of the Declaration of Helsinki. Ethical review and approval were waived for this study, due to the usage of public datasets.
Informed Consent Statement: Informed consent was obtained from all subjects involved in this study when acquiring the public datasets employed in this work.