3.1. Speaker Verification Systems
Speaker verification (SV) systems have been widely used to identify a person through their voice. In our work, we focus on score-based SV systems, which most state-of-the-art SV systems belong to. A score-based SV system contains two phases: speaker enrollment and speaker recognition, as shown in
Figure 1.
In the speaker enrollment phase, a speaker needs to provide an identifier and their audio clips. The SV system transfers the speaker’s voice into a fixed length low dimensional vector called
speaker embedding. Basically, the speaker embedding represents the features of a speaker’s voice and is used to calculate the level of similarity between two audio clips. Different SV systems use different approaches to obtain speaker embedding. The state-of-the-art SV systems include i-vector [
14], GMM [
13], d-vector [
29], and x-vector [
30]. In this work, we focus on GMM and i-vector, since they have been widely used in real life and applied by many as the baseline for comparisons. Here, we use the notation
to refer to the vector of speaker embedding of the registered speaker
R.
Besides obtaining the speaker embedding, the SV system attempts to find a proper threshold for this speaker during the enrollment phase. Such a threshold is a key consideration for a score-based SV system. To understand the importance of the threshold, we first look at the recognition phrase. As shown in
Figure 1, when a test audio clip is provided to the SV system, it abstracts the speaker embedding of this audio, which is referred by
. Then, the SV system calculates a similarity score between vector
and vector
. A higher score reflects more similarity between two vectors of speaker embedding. Finally, the similarity score is compared with the threshold. If the score is higher than or equal to the threshold, the system will accept the test audio clip and treat the speaker as the registered user. Otherwise, the system will reject the access of the speaker.
There are two basic false cases for an SV system: (1) accepting an illegal speaker, i.e., a speaker who is not the registered user, and (2) rejecting the registered speaker. For these two cases, we define two measures to evaluate the performance of an SV system, i.e., the false acceptance rate (FAR) and the false rejection rate (FRR), as follows:
For an ideal SV system, both FAR and FRR are 0. However, in a real SV system, there is a tradeoff between FAR and FRR, which makes it hard to keep both of them at 0. In general, when one decreases, the other will increase. Intuitively, when the threshold increases, it becomes more difficult for an audio clip to be accepted. As a result, FAR will decrease, while FRR will increase. A common practice is to choose a proper threshold that generates the same value for FAR and FRR. Such a value is called the equal error rate (EER) [
31].
To find the EER and the corresponding threshold, both registered speaker audio clips and illegal speaker audio clips need to be provided during the enrollment phase. An SV system calculates the similarity scores for all provided audio clips and then finds the threshold that can lead to the EER.
3.2. Adversarial Attacks against Speaker Verification Systems
The study of adversarial attacks rooted from the research of applying machine learning to security-sensitive applications. In their original work, Biggio et al. pointed out that a well-designed adversarial attack can evade the malware detection in PDF files [
32]. Moreover, Szegedy and Goodfellow et al. demonstrated that deep learning is particularly vulnerable to adversarial examples attacks [
8,
33]. For example, after very small perturbations are added, the image of panda can be recognized as gibbon with 99.3% confidence by a popular deep-learning based classifier [
33]. Later, researchers realized that adversarial attacks can be applied to many different domains, such as cyber-physical systems [
34], medical IoT devices [
35], and industrial soft sensors [
36]. Interested readers can refer to the paper [
9] for a comprehensive survey on adversarial attacks and defenses.
Adversarial attacks and defenses have not yet been comprehensively and systematically studied in the field of SV systems. In the context of an SV system, adversarial attacks attempt to make the SV system falsely accept a well-designed illegal audio, which is called an
adversarial audio. Specifically, the adversarial audio is an original illegal clean audio with small perturbations, often barely perceptible by humans. However, such perturbations lead the SV system to falsely accept the audio. Let
x be an original audio from an illegal user and
p be the perturbation vector with the same length as
x. Then, the adversarial audio,
, can be written as
If p is designed cleverly, humans may notice no or little difference between x and , but an SV system may be tricked to reject x and falsely accept .
To make sure that the adversarial audio is not noticed or detected by humans, the perturbations are usually very small and constrained by a
perturbation threshold,
. That is,
Choosing the value of is an important consideration for adversarial attacks. A larger value of makes the attack easier to succeed, but meanwhile causes it to be more perceptible by humans.
In our work, we study the FakeBob attack [
7] and how to defend against it, since it is the state-of-the-art adversarial attack against SV systems including GMM, i-vector, and x-vector. Specifically, the FakeBob attack is a black-box attack that does not need to know the internal structure of an SV system. Moreover, as shown in [
7], FakeBob achieves at least 99% targeted attack success rate (ASR) on both open source and commercial SV systems, where ASR is defined as follows:
It is noted that both FAR and ASR consider the audios from illegal users. However, the FAR is used for audios without perturbations from adversarial attacks, whereas the ASR is applied for audios with perturbations designed by attackers.
Figure 2 shows the basic process of FakeBob adversarial attacks. Basically, FakeBob applies the basic iterative method (BIM) [
37] and the natural evolution strategy (NES) [
38] to generate the adversarial audio. The attack takes multiple iterations to produce the final adversarial audio (e.g.,
), with the goal of minimizing the following loss function or objective function
where
y is an input audio,
is the threshold of the SV system, and
is the score function that calculates the score of an input audio for SV. FakeBob solves the optimization problem by estimating the threshold
and iteratively finding the input audio that reduces
, through the method of gradient decent over the input audio. Specifically, it applies the following gradient decent function
Note that here the gradient decent is different from the back-propagation that is widely used in deep learning, and the differentiation is based on input audios, instead of the weights of the machine learning model.
Define a sign function
in the following way: For each element (i.e.,
) in the vector
y, a sign function gets the sign of the value of each element in the vector,
i.e.,
Moreover, assume
(
) is a signal in the original clean audio (i.e.,
x) from an illegal speaker,
(
) is the corresponding signal in the adversarial audio at k-th iteration (i.e.,
), and
is the perturbation threshold shown in Equation (
4). Based on the assumption in Equation (
4), a clip function is defined as follows
Using the above functions, FakeBob updates the input audio through the following iteration
where
is the learning rate. The FakeBob attack is summarized in Algorithm 1.
To better understand the FakeBob attack, we look into an example of an adversarial audio in both the time domain and the Mel spectrogram. Specifically, applying the FakeBob with the perturbation threshold of 0.002, we obtained an adversarial audio that is falsely accepted by the GMM SV system.
Figure 3 shows the time waveform of the adversarial audio and the perturbations (i.e.,
p in Equation (
3)) in both the time domain and the Mel spectrogram. It can be observed from
Figure 3b that the perturbations used in the FakeBob attack are very small, i.e.,
. Moreover, the perturbations are similar to white noise, i.e., the perturbations are everywhere with the similar color in the Mel spectrogram, as shown in
Figure 3c. On the other hand, from FakeBob attacks shown in Algorithm 1, it can be observed that these perturbations are not random, but are intentionally designed to fool the SV system.
Algorithm 1 FakeBob Attacks |
1: Input: an audio signal array, threshold of the SV system |
2: Output: an adversarial audio |
3: Require: threshold of targeted SV system , audio signal array A, maximum iteration m, score function S, gradient decent function , clip function , learning rate , and sign function |
4:
|
5: |
6: for ; ; do |
7: |
8: if then |
9: retun |
10: end if |
11: |
12: end for |