Integrated Replay Spoofing-aware Text-independent Speaker Verification

A number of studies have successfully developed speaker verification or spoofing detection systems. However, studies integrating the two tasks remain in the preliminary stages. In this paper, we propose two approaches for the integrated replay spoofing-aware speaker verification task: an end-to-end monolithic and a back-end modular approach. The first approach simultaneously trains speaker identification, replay spoofing detection, and the integrated system using multi-task learning with a common feature. However, through experiments, we hypothesize that the information required for performing speaker verification and replay spoofing detection might differ because speaker verification systems try to remove device-specific information from speaker embeddings while replay spoofing exploits such information. Therefore, we propose a back-end approach using a deep neural network that takes speaker embeddings extracted from enrollment and test utterances and a replay detection prediction on the test utterance as input. Experiments are conducted using the ASVspoof 2017-v2 dataset, which includes official trials on the integration of speaker verification and replay spoofing detection. The proposed back-end approach demonstrates a relative improvement of 21.77% in terms of the equal error rate for integrated trials compared to a conventional speaker verification system.

Table 1 demonstrates the vulnerability of conventional SV systems when faced with replay attacks.The performance is reported using the three types of equal error rates (EERs) described in Table 2 [12].Table 2 shows the target and nontarget trials for calculating the EER, and is represented by 1 and 0, respectively.Zero-effort (ZE)-EER describes the conventional SV performance without the presence of replay attacks.PAD-EER denotes the errors of replay attack detection.Integrated (Int)-EER describes overall performance, including both ZE and replayed non-target trials.Hereafter, we refer to "replay spoofing-aware SV as an integrated speaker verification (ISV) task and report its performance using Int-EER.Results show that the EER degrades to 33.72% with replayed utterances; this fatal performance degradation supports the necessity of a spoofing-aware ISV system.In this paper, PAD includes replay attacks only as the official integrated trials of PAD and ASV are provided for ASVspoof2017 which cover replay attacks only.While a number of studies have worked to develop independent systems for SV and PAD, few have sought to integrate the SV and PAD systems [12][13][14][15][16][17].More specifically, this handful of studies proposed approaches such as cascaded, parallel [12,13], and joint systems [14,16,17].Most existing studies use common features to integrate the two tasks for system efficiency.Section 2 further takes up this existing body of work.
In this study, we propose two spoofing-aware frameworks for the ISV task, illustrated in Figure 1.The first proposed framework expands existing work by proposing a monolithic end-to-end (E2E) architecture.More specifically, it conducts speaker identification (SID) and PAD to train a common feature using multi-task learning (MTL) [18].Concurrently, it uses the embeddings to compose trials and conduct the ISV task.Using the sum of SID, PAD, and ISV losses, the entire DNN is jointly optimized.However, based on tendencies observed during internal experiments, we hypothesize that training a common feature for the ISV task may not be ideal because the properties required for each task differ: the PAD task representation uses device and channel information while SV need to remove it (this is further discussed in Section 3).
Based on our hypothesis, we propose a novel modular approach using a separate DNN.This approach inputs two speaker embeddings (for enroll and test each) and a PAD prediction to make the ISV decision.It adopts a two-phase approach.In the first phase, the speaker identifier and PAD system are trained separately.In the second phase, speaker embeddings are extracted from a pretrained speaker identifier [19], and the embeddings and PAD predictions results are fed to a separate DNN module.Using this framework, we achieved a 21.77% relative improvement in terms of Int-EER. 1he contributions of this paper are: 1. Propose a novel E2E framework which jointly optimizes SID, PAD, and the ISV task 2. Experimentally validate the hypothesis that the discriminative information required for the SV and the PAD task may be distinct, requiring separate front-end modeling 3. Propose a separate modular back-end DNN which takes speaker embeddings and PAD predictions as input to make ISV decision The remainder of the paper is organized as follows.Section 2 details related work on the integrated system of ASV and PAD.Section 3 introduces the two proposed frameworks.Section 4 presents our experiments and results and the paper is concluded in Section 5.

Related work
In this section, we introduce the two studies most relevant to this study [12,16,17].Firstly, Todisco et.al. [12] proposed a separate modelling of two Gaussian back-end systems with a unified threshold for both SV and PAD tasks.This study explored various acoustic features to find which ones best simultaneously suited both tasks.As organizers of the ASVspoof challenges, official trials for the ISV task were released in this study.For our purposes, it is important to highlight that these trials include both ZE and replayed non-target, which we used throughout this paper.However, Todisco et.al. [12] reported the average of two EERs, ZE-EER and PAD-EER, because they separately modeled two Gaussian mixture models for each task.Meanwhile, Li et.al. [16,17] extended Todiscos work [12] by proposing an integrated ISV system, which was the first study to report an Int-EER.More specifically, they proposed a three-phase training framework for extracting an embedding for the ISV task, followed by a probabilistic linear discriminant analysis (PLDA) back-end.In the first phase, MTL [18] was employed to train a common embedding for both SV and PAD tasks.In the second and third phases, the embedding was adapted to fit the ISV task.However, because the DNN was adapted in the third phase to fit the enrollment speakers, it has limitation for real world scenarios.In addition, because the performance was reported using self-configured trials, it is difficult to compare the EER.
In this study, we first propose an E2E framework, illustrated in Figure 1-(a), that extends the work of Li et.al. [16,17] in two aspects.First, we adopt a single phase training approach by using three loss functions for SID, PAD, and ISV.Second, our framework directly outputs a spoofing-aware score without using a separate back-end system.

Integrated speaker verification
In this section, we describe the proposed two frameworks for conducting speaker verification that are aware of replay spoofing attacks as shown in Figure 1.

End-to-end monolithic approach
We first propose an E2E monolithic approach.This architecture simultaneously trains all components, including SID, PAD, and ISV, using a common feature, as illustrated in Figure 1-(a).The loss function for training the proposed E2E architecture comprises three components: a categorical cross-entropy (CCE) loss for SID, a binary cross-entropy (BCE) loss for PAD, and a twoclass BCE loss for ISV.When a mini-batch is input for training, the proposed system first conducts SID and PAD with an MTL framework.Then, it composes a number of trials.A trial consists of two embeddings, one for enroll, and the other for test.The ISV prediction is made by feed-forwarding the two embeddings through a few fully-connected layers.The entire DNN is jointly optimized using the sum of three loss functions.The objective function Loss is defined as follows: where LossSID refers to the CCE loss for SID, LossP AD is the BCE loss for PAD, and LossISV denotes the CCE loss of ISV.However, we found consistent tendencies that it is difficult to extract a common representation, i.e. feature, for performing both SV and PAD tasks through experiments.Therefore, we hypothesize that, although SV and PAD tasks are closely related in the scenario, the discriminative information required for each task collides.Speaker embeddings for the SV task requires robustness to device and channel difference; meanwhile, representation for the PAD task uses such information [20].Also, both bona-fide and replayed utterances include the same speaker information, making it a less discriminative factor for the PAD task; meanwhile, it is key information for the SV task.The study of Sahidullah et.al. [13] supports our hypothesize, which analyzes that the SV and PAD tasks should exist independently.To validate our hypothesis, we address experiments using separately trained SV and PAD systems and MTL-based systems.We further detail these elements in Section 4.3.

Back-end modular approach
We also propose a novel modular approach using a separate DNN that take speaker embeddings and PAD predictions as input to make an ISV decision.Figure 1-(b) illustrates our second proposed system.We use LCNN architecture [21] to extract both speaker embeddings and spoofing predictions; this choice is based on its success in various spoofing detection studies [11,22].
Based on the hypothesis addressed in the previous subsection, we design an integrated system using a two-phase approach.In the first phase, we separately train an SID system to extract speaker embeddings from the last hidden layer and a PAD system to extract a spoofing prediction.Then, we train the ISV system by using two speaker embeddings (one for enroll and the other for test) extracted from the SID system as a pair and a PAD label as an input.This system has a output layer with two nodes: the first node indicates "acceptance and the second node indicates "rejection' for both ZE and replay trials.
In Figure 1-(b), the part trained in phase 2 is the proposed back-end ISV system.It takes two speaker embeddings and multiplication of the two embeddings as input and a module of four fully-connected layers outputs a scalar that indicates whether they are uttered by the same speaker.The fullyconnected layers comprise 256 nodes each and an output layer comprise one node with a sigmoid function.
Next, the SV and PAD prediction results, and their multiplication are fed to a fully-connected layer to make the final decision.In an ideal scenario, the multiplication of the SV result and PAD prediction would indicate 1 when both SV and PAD are positive and 0 otherwise; we assume this multiplication would additionally inform the final decision.The objective function Lossint for the back-end modular approach comprises loss for the SV task and the loss for the final decision, defined as: where LossSV and LossISV refer to the BCE loss of the SV task and the CCE loss of the ISV task, respectively, and α signifies the weight for the SV loss.We note that training the proposed back-end DNN with only LossISV results in overfitting.
Based on a number of experiments that we omit for sake of brevity, we found two key components that made our proposed back-end DNN framework successful.First, we model ZE and replay trials into separate score distributions.Figures 2-(a) and (b) respectively illustrate the score distributions of the evaluation trials of the SV baseline and the proposed modular back-end DNN.In Figure 2-(a), the score refers to the cosine similarity of the two embeddings.Here, the score distribution of replay non-target trials severely overlaps with that of target trials.In our analysis, this results from embeddings that only considered speaker information in which replayed and bonafide utterances coincided.In various experiments, it was impossible to model both replay and ze non-target trials into the same score distribution.When one kind of non-target trial was successfully modeled, the other resulted in a distribution similar to uniform.Therefore, we aim to separate two non-target score distributions, specifically by modelling the score distribution of ze non-target to have mean 0.5 and replay non-target to have zero mean.To do so, we sequentially apply rectified linear unit (ReLU) and sigmoid activation functions to the output of SV, before the last hidden layer for ISV. Figure 2-(b) demonstrates the score distribution of the proposed method.The results demonstrate that three types of evaluation trials are modeled as intended (i.e.well generalized) in case of evaluation trials although these trials comprise unknown speakers and replay conditions.
Second, we use actual PAD labels instead of PAD predictions of the spoofing DNN in the training phase.It is based on empirical comparisons in which the use of PAD predictions in the training phase worsened performance.In our analysis, using PAD labels in the training phase was more helpful because even a small number of misclassified utterances among PAD predictions can interrupt the training of the proposed DNN.Notably, we empirically observed model collapse when training the proposed modular DNN using PAD predictions.

Dataset
All experiments in this study were conducted using the ASVspoof2017-v2 dataset [23]. 2 To evaluate the proposed integrated systems, we used the trials reported in [12].We used training and development sets to train all systems comprising 2267 bona-fide and 2457 replay spoofed utterances from 18 speakers.To evaluate speaker verification and spoofing detection performances, we measured the ZE-EER and the PAD-EER using the ASVspoof2017 joint PAD+ASV evaluation trial.This trial comprised 1106 target, 18624 ze, and 10878 replayed trials.We use target & ze for ZE-EER and target & replayed for PAD-EER evaluations.
Regarding our use of ASVspoof2017-v2, we found that relatively thin LCNN structures were helpful for performance improvement; this may have been a result of the small size of the dataset.In addition, we also found that minute changes in DNN greatly influence the performance because of the small data scale, therefore, a relatively thin structure remained particularly helpful for performance improvement.To derive a value between 0 and 1 for the PAD task, we used a network architecture identical to that of [11] but replaced the angular margin softmax activation [25] with a sigmoid function.We also modified the architecture for the SV task based on [11].Speaker embeddings had a dimensionality of 1024.

Results analysis
Table 3 describes the results of the proposed E2E framework with a monolithic approach.System #1 refers to the proposed architecture that jointly optimizes SID, PAD, and ISV loss, Figure 1-(a).System #2-SE is the result of applying squeezeexcitation (SE) [26] based on its recent application to PAD [9].System #3 describes the result of assigning three max feature map (MFM) blocks [21] for SID as well as for PAD after the first three MFM blocks.Because most of the systems performance measures deteriorated compared to the SV baseline, we concluded that the monolithic E2E approach was not ideal for the ISV task.While the results of the experiments were different from what we expected, they nevertheless serve as a springboard for establishing a new hypothesis.
Table 4 addresses the validation of our hypothesis in Section 3 that the discriminative information for the SV and the PAD task are distinct based on the results of Table 3.To validate our hypothesis, we trained our SV and PAD baselines with and without additional loss for extracting common embeddings.Here, the first and third rows refer to the SV and PAD baselines and the second and fourth rows refer to the usage of the Table 5 summarizes the results of performance improvement across various attempts to improve the performance of the proposed method in the back-end modular approach.The comparison of Systems #4 and #5 shows the effectiveness of using multiplication of the SV result and PAD prediction for the ISV task.System #6 refers to the result of setting weights to the SV task in the training phase where we set the α to 20.System #7 shows the result of reducing the number of nodes per hidden layer.
Finally, Table 6 compares our proposed modular approach with the SV baseline and existing work [12] using official trials.The results demonstrate that the proposed approach stabilizes unbalanced performance between ZE-EER and PAD-EER.Compared with the SV baseline, which does not consider PAD attacks, we achieved a relative improvement of 21.77%.Important to note here is that we were unable to compare the Int-EER with that of Todisco et.al. [12], although it is the only study that reported performance using official trials.Because it proposed a unified threshold for conducting SV and PAD tasks, Int-EER results using the full trial does not exist.

Conclusion
In this paper, we investigated the integration of speaker verification and replay spoofing detection.We proposed two methods for their integration: an E2E monolithic approach and a backend modular approach.The proposed E2E approach simultaneously trains SID, PAD, and ISV, using a common feature.The experimental results of the E2E approach led us to hypothesize that the discriminative information for SID and PAD differs.Based on our hypothesis, we proposed a framework using a separate back-end DNN that takes speaker embedding and a PAD prediction extracted from pretrained SV and PAD systems as input.The effectiveness of our proposed systems was verified using official trials for the ISV task where we achieved an EER of 15.63%.It is expected that the proposed method will continue to enhance performance when improved speaker embeddings and PAD prediction are input.

Figure 1 :
Figure 1: (a): An end-to-end architecture that trains embeddings (used for speaker identification and spoofing detection) and ISV concurrently, (b): A separate architecture that inputs speaker embeddings and spoofing detection results and outputs the ISV result.

Figure 2 :
Figure 2: Histograms of score distribution on the evaluation trials.(a): SV baseline where score is calculated using cosine similarity of two speaker embeddings.(b): The proposed modular system where three types of trials have three different distributions.

Table 1 :
Difference in EER according to the existence of replay non-target trials.Results demonstrate the vulnerability of speaker verification systems unaware of replay spoofing attacks.

Table 2 :
Three types of EERs reported in this paper.Enrollment utterance is always bona-fide.Target: enroll and test ut- terances are uttered by an identical speaker and are bona-fide, ZE non-target: enroll and test utterance are uttered by different speakers and are bona-fide, Replay non-target: enroll and test utterances are uttered by an identical speaker and test utterance is replay spoofed.

Table 3 :
Results of various architectures using the proposed monolithic E2E framework for the ISV task.

Table 4 :
Experimental results showing that the required discriminative information differs for SV and PAD (Sid: speaker identification, PAD: replay spoofing detection, Int: integrated speaker verification).

Table 5 :
Results of the proposed modular approach for the ISV task

Table 6 :
Comparison of the SV baseline, our proposed modular DNN, and other work using the official trials for the ISV task.