DriNet: Dynamic Backdoor Attack against Automatic Speech Recognization Models

Ye, Jianbin; Liu, Xiaoyuan; You, Zheng; Li, Guowei; Liu, Bo

doi:10.3390/app12125786

Open AccessArticle

DriNet: Dynamic Backdoor Attack against Automatic Speech Recognization Models

by

Jianbin Ye

¹,

Xiaoyuan Liu

¹,

Zheng You

²,

Guowei Li

¹

and

Bo Liu

^1,*

¹

College of Computer Science and Technology, National University of Defense Technoloy, Changsha 410073, China

²

College of Physics and Information Engineering, Fuzhou University, Fuzhou 350000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(12), 5786; https://doi.org/10.3390/app12125786

Submission received: 24 May 2022 / Revised: 3 June 2022 / Accepted: 5 June 2022 / Published: 7 June 2022

(This article belongs to the Special Issue AI for Cybersecurity)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Automatic speech recognition (ASR) is popular in our daily lives (e.g., via voice assistants or voice input). Once its security attributes are destroyed, it poses as a severe threat to a user’s life and ‘property safety’. Prior research has demonstrated that ASR systems are vulnerable to backdoor attacks. A model embedded with a backdoor behaves normally on clean samples yet misclassifies malicious samples that contain triggers. Existing backdoor attacks have mostly been conducted in the image domain. However, they can not be applied in the audio domain because of poor transferability. This paper proposes a dynamic backdoor attack method against ASR models, named DriNet. Significantly, we designed a dynamic trigger generation network to craft a variety of audio triggers. It is trained jointly with the discriminative model incorporated with an attack success rate on poisoned samples and accuracy on clean samples. We demonstrate that DriNet achieves an attack success rate of 86.4% when infecting only 0.5% of the training set without reducing its accuracy. DriNet can still achieve comparable attack performance to backdoor attacks using static triggers, further enjoying richer attack patterns. We further evaluated DriNet’s resistance to a current state-of-the-art defense mechanism. The anomaly index of DriNet is more than 37.4% smaller than that of BadNets method. The triggers generated by DriNet are hard reverse, keeping DriNet from the detectors.

Keywords:

backdoor attack; speech recognition; dynamic trigger

1. Introduction

Automatic speech recognition is being widely used with the development of deep learning and big data technologies, such as Google Assistant (https://assistant.google.com, accessed on 23 May 2022), Microsoft Cortana (https://www.microsoft.com/en-us/cortana, accessed on 23 May 2022), Xiaomi AI speaker (https://www.mi.com/aispeaker/, accessed on 23 May 2022), and Deep Speech (http://research.baidu.com/Blog/index-view?id=90, accessed on 23 May 2022). It plays an essential role in using voice as a carrier and medium to interconnect people and devices. Automatic speech recognition is so wildly used that, once it suffers from an adversary, it may cause severe security and privacy consequences in real-world applications.

Despite a broad application of neural networks, the literature shows that neural networks are vulnerable to various attacks [1,2,3,4]. Backdoor attacks—different from the adversarial attacks launched during the testing phase—pose a more serious threat, i.e., by adding trigger patterns to the target model at training time. In general, the adversary intends to embed hidden backdoors into deep neural networks during training so that the victim models normally perform on benign samples, while their predictions will be maliciously altered if the hidden backdoor is activated by the trigger patterns specified by the adversary [5]. While the backdoor attack is a well-studied phenomenon and has been demonstrated to work in the vision field [6,7,8,9], research on automatic speech recognition is rare. The structures and characteristics of audio data make attacks against them very different from those in image domains. Backdoor attacks in audio domains deserve further study.

The triggers used by backdoor attacks can be divided into two categories: (1) static trigger, which is a fixed pattern placed in a fixed location; and (2) dynamic trigger, which has flexible styles and positions and appears in multiple areas. Current state-of-the-art backdoor attack schemes [6,10,11] rely excessively on fixed static triggers (in terms of patterns and locations), which lack flexibility and are easily detected by existing defense mechanisms [12,13]. They generate a trigger with a static value and position corresponding to a certain target label to obtain this simple one-to-one relationship learned by the neural network model. Due to the monotonous attack style, once the static backdoor attack method is interfered with and restricted in practical applications, its attack performance will be significantly reduced. Compared with static triggers, a dynamic trigger backdoor attack is more powerful and more applicable to the physical world [14]. However, backdoor attacks using dynamic triggers have not been well explored in the audio domain yet. It emphasizes the necessity of designing dynamic backdoor attacks under more realistic conditions.

In this paper, we present the design and implementation of DriNet, a targeted dynamic backdoor attack method for automatic speech recognition models that rely on neural networks to generate different triggers. Triggers of different frequencies and amplitudes can be embedded into the audio samples as required. In contrast to static trigger backdoor attacks, DriNet exploits flexible trigger patterns for more effective backdoor attacks. We evaluated the effectiveness of the proposed method on two audio classification model architectures over two benchmark datasets. The results indicate that DriNet achieves an over 98% targeted attack success rate with only 1% of poisoned training samples (without compromising accuracy). To further discuss the performance of triggers of different sizes, we designed three types of triggers to launch the attack. Moreover, we utilized the state-of-the-art backdoor defense technique to evaluate our method. Our results demonstrate that our dynamic triggers can not be reversed by the defense algorithm. The main contributions of this paper are the following:

We explored a novel dynamic backdoor attack paradigm in the audio domain (a challenge rarely explored). In our threat model, dynamic triggers are used to attack automatic speech recognition systems, enjoying rich attack styles and greater stealth.
We introduce a dynamic trigger generation method via the generative adversarial network (GAN) jointly trained with a discriminative model. The effectiveness of backdoor attacks with dynamic triggers is successfully demonstrated.
Experimental results on two benchmark datasets show the superiority of our method compared to the baseline BadNets. Moreover, an extensive analysis verifies the resistance of the proposed method to the state-of-the-art defense.

The remainder of the paper is structured as follows. Related works on automatic speech recognition and backdoor attacks are provided in Section 2. We elaborate on the detailed design of our dynamic backdoor attack in Section 3. In Section 4, We demonstrate the analysis and evaluation results. We conclude this paper in Section 5.

2. Related Work

2.1. Automatic Speech Recognition

An automatic speech recognition model can map the raw audio waveforms into the content. For the digit recognition task, an audio speech recognition system consists of the following three critical steps.

Preprocessing. The speech signal may be muted part of the time, which brings an extra cost to the speech signal processing. Therefore, voice activity detection (VAD) first needs to be performed on the audio waveform to extract the part with voice activity and ignore the silent part.

Feature Extraction. The most popular speech features in previous speech processing systems are the Mel frequency cepstrum coefficient (MFCC) [15] and Mel spectrogram features [16]. Mel spectrogram feature extraction requires pre-emphasis, framing, windowing, fast Fourier transform, applying Mel filter banks, and logarithmic. After discrete cosine transform (DCT), the Mel spectrogram is converted to MFCC. In the context of the previous defective machine learning algorithms, the Mel spectrogram features are intercorrelated, causing some algorithm trouble. Nevertheless, the DCT operation itself is a lossy compression, which can discard some information and cause damage to the data. Nowadays, neural networks are internally complex, and the problem of intercorrelation can be weakened during the training process. Therefore, in this paper, we adopted the Mel spectrogram as the feature.

Model Prediction. The extracted features are fed into the constructed model and matched with a model to generate the output prediction results. Finally, the mapping from the audio waveform to the digit is completed.

2.2. Backdoor Attack

Backdoor attacks are emerging forms of attacks against deep learning models that can be triggered via a crafted input by the adversary. Poisoning the training samples is the most straightforward and widely adopted method for adding functionality to the training process. The adversary modifies some training samples by adding specified triggers. These poisoned samples with specified target labels and the remaining benign training samples are fed into the model for training, embedding backdoors in the victim model. Existing backdoor attacks have wide applications in the visual and audio fields.

Backdoor attacks in the vision domain. BadNets [6] introduced the first backdoor attack in deep learning by poisoning some training samples. They use white pixel blocks of specific sizes as triggers and always place them in the upper left corners of all inputs. BadNets is a typical static backdoor attack technique that has gained much popularity in backdoor attacks. Currently, most backdoor attacks are sample-independent, i.e., all poisoned samples contain the same trigger pattern. This property is widely used in the design of backdoor defenses. Nguyen and Tran [17] proposed the first sample-specific backdoor attack, where different poisoned samples contain different trigger patterns. This attack can bypass many existing backdoor defenses because it breaks the basic assumptions. Salem et al. [8] proposed the first class of dynamic backdoor attacks, namely BaN and c-BaN. They dynamically determine the location as well as the style of triggers through a network, enhancing the effectiveness of the attack. They achieve strong performances while bypassing the current state-of-the-art backdoor defense techniques.

Backdoor attacks in the audio domain. Kong and Zhang [18] proved that hidden information can illegally be used by an adversarial to activate the backdoor of a DNN-based ASR model that is embedded into an intelligent speaker. Zhai et al. [4] designed a clustering-based attack scheme where poisoned samples from different clusters contain different triggers to infect speaker verification models by poisoning the training data. Koffas et al. [19] designed ultrasonic triggers to perform backdoor attacks on automatic speech recognition systems by poisoning the data. However, the ultrasonic waves fail in the low-pass filter and lack practicality. VenoMave [20] proposed the first targeted poisoning attack against automatic speech recognition. This method causes the automatic speech recognition system to transcribe

x_{t}

into the target word W by altering the specific sequence of frames (poison frames) in the original audio input

x_{t}

. Nevertheless, it takes hours to complete the attack. Li et al. [21] introduced a new backdoor attack method—the trojan neural network (TNN)—against deep learning models. The TNN scans the target neural network, selecting some neurons. Then, it iteratively modifies the input according to the differences between the current value and the expected value of the selected neuron.

2.3. Defense for Backdoor Attack

STRIP [13] can evaluate a model after it has been deployed rather than during the validation phase. STRIP relies on the assumption that the trigger is sample–agnostic. This method adds clean images from the validation set as watermarks to the samples and obtains multiple perturbed inputs with watermarks added. After that, it makes the model predict the entropy of the perturbed inputs. If the calculated entropy value is less than a certain threshold, the sample is considered poisoned.

In contrast, the research results of a neural cleanse are based on an important assumption: the triggers corresponding to backdoor models are much smaller than the minimum perturbation that modifies the input to make the normal model misclassify [12]. It is the first approach used to detect and identify backdoors in deep neural networks, which can mitigate backdoor attacks.

3. Design of Dynamic Backdoor Attacks

3.1. Treat Model

In this paper, we assume that the adversary has full access to the model and the data it needs for training, whereas the adversary has no information about (or modifies other) training configurations. The threats discussed in this paper may occur in many real-world scenarios, such as users adopting third-party training data to train their models or transferring models to training platforms.

Generally, the adversary intends to induce the models trained on it by end users by adding triggers to the training data to contaminate the dataset. The goal of the adversary is to mislead the network when the backdoor trigger is presented but perform well in predicting clean data.

3.2. Problem Formulation

We denote the backdoor discriminative model as

M : X \to Y

, which maps an input

x_{i}

from the instance space X to

y_{i}

from label space

Y = {y_{1}, y_{2}, \dots, y_{n}}

as follows:

M (x_{i}) = y_{i} .

(1)

The adversary adds the crafted trigger

t \in T

to the input

x_{i}

via a backdoor adding function as,

x_{i}^{'} = A (x_{i}, t),

(2)

where

A (\cdot)

denotes a mixing of two pieces of audio. T is the set of triggers generated by the adversary. Then the poisoned input

x_{i}^{'}

is to be predicted as

y_{t}

by the M, i.e.,

M (x_{i}^{'}) = y_{t},

(3)

where

y_{t} \neq y_{i}

. It is a perfect model that the adversary prefers when any clean input is predicted to be the correct label and inputs with the trigger are incorrectly determined to be other labels.

3.3. Overview of Dynamic Backdoor Attack

The details of our dynamic backdoor attack method are illustrated in Figure 1. It is divided into three steps: (a) dynamic trigger generation. Given a random noise, this step generates an adversarial dynamic trigger using gradient information through iterative optimization. (b) Joint training. We choose clean data (a certain percentage) and inject the generated trigger (by step (a)) into them. At the training stage, the discriminative is trained on all clean data and the infected data produced by the trigger generation model. (c) Poisoning. We use the generated dynamic triggers to infect audio with a certain poisoning rate as a training set for end users.

3.4. Dynamic Trigger Generation

We use a dynamic technique to optimize the trigger dynamic generation algorithm by building a neural network to adjust the patterns and frequency of audio triggers adaptively. The details of the architecture and parameters of the trigger generation model constructed in this paper are shown in Table 1. We use the conv1d layer (Conv1d) and the fully connected layer (FC) to build the trigger generation model. The input of the model is a two-dimensional array (

f r e q \times n

) of noise, where the freq and n denote the frequency of the noise and the number of frames it has, respectively. The noise vector is sampled from the uniform distribution (

n o i s e \sim U (0, 1)

). Alternatively, the adversarial can utilize different distributions to sample noise, such as Gaussian or a custom distribution function. Moreover, dropout layers are applied after the last three fully connected layers to avoid the model producing the same output, increasing the diversity of the generated triggers. After the reshape layer, the model can generate an adversarial audio trigger with the same shape as the random noise. The audio trigger is converted back to the time domain via an inverse Fourier transform to facilitate it to be mixed with the original sound waves and be played in the physical world.

3.5. DriNet Algorithm

For the training phase, inspired by the generative adversarial network (GAN) technique [22], we consider the target victim model as the discriminative model D and the trigger generation model as the generative model G. We jointly train the D and the G using Algorithm 1.

Algorithm 1 Joint training of the target victim model and the trigger generation model

Input: The training dataset Q, the epochs, the loss function

L

, the uniform distribution function (random), the target label l, the backdoor add function

A

, the weight

α

Output: D, G

1:: Initialize $e p o c h = 0$
2:: while $e p o c h < e p o c h s$ do
3:: $x, y \leftarrow Q$
4:: $\hat{y} = D (x)$
5:: $L o s s_{1} = L (y, \hat{y})$
6:: $\tilde{x} \leftarrow$ Select the samples from x whose $y \neq l$
7:: $n o i s e \leftarrow R a n d o m (0, 1)$
8:: $t r i g g e r = G (n o i s e)$
9:: $x^{'} = A (\tilde{x}, t r i g g e r)$
10:: ${\hat{y}}^{'} = D (x^{'})$
11:: $L o s s_{2} = L (l, {\hat{y}}^{'})$
12:: Update G with $L o s s_{2}$
13:: $L o s s = L o s s_{1} + α * L o s s_{2}$
14:: Update D with Loss
15:: $e p o c h = e p o c h + 1$
16:: end while
17:: return D, G

During training, we first fetch a batch of samples from the training dataset Q. Then, we calculate the cross-entropy loss

L o s s_{1}

with the loss function

L

between the prediction y of the D for this batch of samples and the true label

\hat{y}

, as in (4),

L (y, \hat{y}) = - \sum_{i = 1}^{| x |} y_{i} log {\hat{y}}_{i},

(4)

where

| x |

is the dimension of the input. After that, we select different audios from all non-target classes, which are relabeled with the target label. In the following, we craft poisoned samples by mixing the clean audios selected with the triggers generated by the trigger generation model. The backdoor add function used in this work is to embed the trigger at a random frequency into the beginning of the clean audio. Likewise, we calculate the second cross-entropy loss

L o s s_{2}

with the

L

. Finally, we update the trigger generation model G with

L o s s_{2}

and update the target discriminative model D with

L o s s_{1} + α * L o s s_{2}

, where

α

is a hyperparameter to trade-off between accuracy and attack success rate.

3.6. Poisoning

To complete the backdoor attack, we use the trained trigger generation model to make a poisoned data subset

Q_{p o i s o n}

, i.e.,

Q_{p o i s o n} = {(x^{'}, l) | x^{'} = A (x, t), (x, y) \in Q_{c l e a n}}

, where

x^{'}

indicates the poisoned samples, l denotes the target label,

Q_{c l e a n}

is the original clean dataset. Then, we fuse it with

Q_{c l e a n}

to form a new dataset

Q_{t r a i n}

, i.e.,

Q_{t r a i n} = Q_{p o i s o n} \cup Q_{c l e a n} .

(5)

We define

γ

as the poisoning rate as in (6),

γ = \frac{| Q_{p o i s o n} |}{| Q_{t r a i n} |} .

(6)

4. Experiments

4.1. Experimental Settings

Datasets and Target Models. To evaluate the effectiveness of our proposed method in this experiment, we take the raw audios from the AudioMNIST dataset [23], and the speech commands dataset [24], described in Table 2. The AudioMNIST dataset consists of 30,000 audio samples of spoken digits (0–9), in English, from 60 speakers of multiple age groups, different genders, and different accents. The speech commands dataset contains 105,829 audio files, constituted by 35 different kinds of spoken single-word commands, including ten digits, command words, and other common words. Among them, we choose 10 classes of digit words for evaluation, a total of 38,908 samples. These two datasets are split into the 80% training set and 20% test set.

We use two target victim models with different architectures in this paper, namely 1D CNN [25] and X-vector [26]. The CNN models are well known for their excellent performance in the computer vision domain. Specifically, 1D CNN demonstrates its good performance in speaker recognition systems. The X-vector is the current baseline model framework in speech recognition. The network consists of a deep neural network that maps the speech feature sequences to a fixed dimensional embedding. We modify them to adapt to the Mel spectrogram features in this paper.

Baseline and attack setup. We adopt the adapted BadNets method as the baseline for comparison. Similar to how it works in images, BadNets poisons the training set with a fixed static audio trigger to form a backdoor. The amplitude of the fixed static audio trigger used by BadNets is about the same as the average amplitude of the dynamic trigger we generate. Moreover, the frequency of the fixed static is permanently fixed at the lowest three frequencies on the Mel scale [27].

All models are implemented in PyTorch [28] 1.11.0 and CUDA 11.4. We set

e p o c h = 100

and

α = 0.1

for the 1D CNN model, while

e p o c h = 100

and

α = 0.07

are set for the X-vector. We set the target label to zero for demonstration. In the training phase, we use an Adam [29] optimizer with a learning rate of 0.001. As for the size of the triggers, we discuss the performance impact of three kinds of shapes of triggers, (1 × 5), (2 × 5), and (3 × 5).

4.2. Evaluation Metrics

The dynamic backdoor attack aims to make the victim model misclassify samples with triggers while maintaining good performance for clean samples. Thus, we employ two metrics to evaluate DriNet: clean samples accuracy and attack success rate.

Clean Samples Accuracy. We use the equation,

A c c u r a c y = \frac{N_{c}}{N_{a}}

, to obtain the accuracy of the model predicting the clean samples, where

N_{c}

is the number of clean samples that are correctly predicted, and

N_{a}

denotes the number of all clean samples that we test.

Attack Success Rate (ASR). We calculate the attack success rate as

A S R = \frac{N_{t}}{N_{p}}

, where

N_{t}

is the number of poisoned samples that are misclassified as the target label and

N_{p}

represents the number of samples with triggers. In particular, a targeted attack is considered successful only if the poisoned sample is predicted to be the specific label.

4.3. Results

We evaluate the performance of our dynamic backdoor attack for the 1D CNN model and the X-vector model on the AudioMNIST dataset and the speech commands dataset, respectively. For the AudioMNIST dataset, Table 3 demonstrates that successively increasing the poisoning rate from 0.005 to 0.2 leads to a higher attack success rate while maintaining excellent clean sample accuracy(at least 0.997). Moreover, with the trigger size of (3 × 5), we can achieve up to an 86.4% attack success rate by poisoning only 0.5% of the training set for the 1D CNN model and achieve over 98% targeted ASR with 1% of poisoned samples. Moreover, for the speech commands dataset shown in Table 4, the attack success rate gradually goes up as the poisoning rate increases from 0.03 to 0.5. However, this improvement has side effects that the accuracy reduces to varying degrees, the lowest is 8.8%, and the highest is 17.1%.

We also discuss the effect of the trigger size on dynamic backdoor attack effectiveness. In general, a larger trigger size achieves a more vigorous attack. For example, triggers of 3 × 5 shapes significantly surpass triggers of 1 × 5 shapes, in terms of both accuracy and attack success rate with the same poisoning rate. However, this will inevitably make the trigger more conspicuous. A trade-off should be made between the attack success rate and stealthiness in a real-world application.

The clean sample accuracy and attack success rate concerning BadNets and DriNet are shown in Table 5. We find that our dynamic backdoor attack method can still achieve comparable attack performance to the static backdoor attack method. Intuitively, a dynamic trigger enjoys a richer attack pattern than a static trigger due to its multiple styles. In a complex environment, for the dynamic backdoor model, the adversary can use a variety of attack samples, but for the static backdoor model, the attack is limited. Furthermore, dynamic backdoor attacks enjoy more resistance and stealth than the static method. Thus, DriNet has more tremendous advantages than a static backdoor attack.

Defense. We use both datasets to evaluate our technique by the state-of-the-art defense, namely neural cleanse. Neural cleanse is a model-based defense mechanism in the model validation testing phase.

Neural cleanse iteratively optimizes a trigger for each given target label to find the minimum one required to misclassify the samples from other labels. In this way, it can reverse N potential triggers for each label. Then, neural cleanse measures the sizes of the triggers and detects if any trigger is significantly smaller than the others based on the median absolute deviation (MAD) [30].

As shown in Figure 2, the models constructed by DriNet have a lower anomaly index than the baseline models and perform even better than the clean models in most circumstances. The anomaly index of DriNet is 37.4% smaller than that of the BadNets method for the 1D CNN models on the AudioMNIST dataset. Moreover for X-vector models on the speech commands dataset, this ratio is as high as 79.4%. Moreover, the anomaly index of our models is less than the threshold, which means that neural cleanse fails to detect the backdoor models attacked by DriNet. The potential triggers reversed by neural cleanse contain information in multiple locations and styles, resulting in large enough masks to make the anomaly index more minor than the threshold. In other words, dynamic triggers generated by DriNet are more challenging to reverse than static triggers. We believe that the neural cleanse mechanism fails to detect our hidden backdoor because DriNet breaks the fundamental assumption about backdoor attacks—that triggers are invariant in position and style.

4.4. Visualization

Figure 3 shows three examples of poisoned audio files from three digits. They are all marked with target tags after adding triggers, shown right below themselves. Triggers are mixed with the original audio instead of appended to it, which does not change the audio file’s size and improves the stealth of the attack. The frequencies and patterns of triggers are different for the same target tag, providing more adversarial attack samples to the adversary.

Figure 4 shows three kinds of triggers generated by our trigger generation model. For example, Figure 4a presents an audio trigger with the size of 1 × 5, meaning a sound that lasts five frames at one certain frequency. Similarly, we craft triggers at two or three adjacent frequencies on the Mel scale, as in Figure 4b,c.

5. Conclusions

In this paper, we explored the problem of backdoor attacks against automatic speech recognition. We propose DriNet, a dynamic backdoor attack method to generate different patterns and frequency audio triggers against automatic speech recognition models. Introducing the GAN technique, we jointly designed a trigger generation model and trained it and the target discriminative model. We conducted extensive experiments using two benchmark datasets. The evaluation verifies the effectiveness of our proposed dynamic backdoor attack by poisoning a small proportion of the training set. We show that DriNet can successfully bypass a state-of-the-art defense against backdoor attacks while achieving strong performance as the static backdoor attack method.

Although we took some steps in dynamic backdoor attack technology against speech recognition models, improvements could be made to our method. The poisoned audio does not sound the same as normal audio if listeners pay attention to the difference. Furthermore, we aim to make the triggers imperceptible by introducing principles of psychoacoustics.

Author Contributions

Conceptualization, J.Y. and X.L.; data curation, J.Y.; methodology, X.L., Z.Y., and B.L.; visualization, Z.Y. and G.L.; writing—original draft, J.Y.; Writing—review and editing, G.L. and B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Abdullah, H.; Warren, K.; Bindschaedler, V.; Papernot, N.; Traynor, P. Sok: The faults in our asrs: An overview of attacks against automatic speech recognition and speaker identification systems. In Proceedings of the 2021 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 24–27 May 2021; pp. 730–747. [Google Scholar]
Kurakin, A.; Goodfellow, I.J.; Bengio, S. Adversarial examples in the physical world. In Artificial Intelligence Safety and Security; Chapman and Hall/CRC: New York, NY, USA, 2018; pp. 99–112. [Google Scholar]
Sharif, M.; Bhagavatula, S.; Bauer, L.; Reiter, M.K. Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 1528–1540. [Google Scholar]
Zhai, T.; Li, Y.; Zhang, Z.; Wu, B.; Jiang, Y.; Xia, S.T. Backdoor attack against speaker verification. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2560–2564. [Google Scholar]
Li, Y.; Wu, B.; Jiang, Y.; Li, Z.; Xia, S.T. Backdoor learning: A survey. arXiv 2020, arXiv:2007.08745. [Google Scholar]
Gu, T.; Liu, K.; Dolan-Gavitt, B.; Garg, S. Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access 2019, 7, 47230–47244. [Google Scholar] [CrossRef]
Liu, Y.; Ma, X.; Bailey, J.; Lu, F. Reflection backdoor: A natural backdoor attack on deep neural networks. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 182–199. [Google Scholar]
Salem, A.; Wen, R.; Backes, M.; Ma, S.; Zhang, Y. Dynamic backdoor attacks against machine learning models. arXiv 2020, arXiv:2003.03675. [Google Scholar]
Zhao, S.; Ma, X.; Zheng, X.; Bailey, J.; Chen, J.; Jiang, Y.G. Clean-label backdoor attacks on video recognition models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 14443–14452. [Google Scholar]
Yao, Y.; Li, H.; Zheng, H.; Zhao, B.Y. Latent backdoor attacks on deep neural networks. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, London, UK, 11–15 November 2019; pp. 2041–2055. [Google Scholar]
Turner, A.; Tsipras, D.; Madry, A. Clean-Label Backdoor Attacks. Available online: https://people.csail.mit.edu/madry/lab/ (accessed on 20 May 2022).
Wang, B.; Yao, Y.; Shan, S.; Li, H.; Viswanath, B.; Zheng, H.; Zhao, B.Y. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2019; pp. 707–723. [Google Scholar]
Gao, Y.; Xu, C.; Wang, D.; Chen, S.; Ranasinghe, D.C.; Nepal, S. Strip: A defence against trojan attacks on deep neural networks. In Proceedings of the 35th Annual Computer Security Applications Conference, San Juan, PR, USA, 9–13 December 2019; pp. 113–125. [Google Scholar]
Gao, Y.; Doan, B.G.; Zhang, Z.; Ma, S.; Zhang, J.; Fu, A.; Nepal, S.; Kim, H. Backdoor attacks and countermeasures on deep learning: A comprehensive review. arXiv 2020, arXiv:2007.10760. [Google Scholar]
O’Shaughnessy, D. Automatic speech recognition: History, methods and challenges. Pattern Recognit. 2008, 41, 2965–2979. [Google Scholar] [CrossRef]
Liu, Y.; Ma, S.; Aafer, Y.; Lee, W.; Zhai, J.; Wang, W.; Zhang, X. Trojaning Attack on Neural Networks. In Proceedings of the 25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, CA, USA, 18–21 February 2018. [Google Scholar]
Nguyen, T.A.; Tran, A. Input-aware dynamic backdoor attack. Adv. Neural Inf. Process. Syst. 2020, 33, 3454–3464. [Google Scholar]
Kong, Y.; Zhang, J. Adversarial audio: A new information hiding method and backdoor for dnn-based speech recognition models. arXiv 2019, arXiv:1904.03829. [Google Scholar]
Koffas, S.; Xu, J.; Conti, M.; Picek, S. Can You Hear It? Backdoor Attacks via Ultrasonic Triggers. arXiv 2021, arXiv:2107.14569. [Google Scholar]
Aghakhani, H.; Schönherr, L.; Eisenhofer, T.; Kolossa, D.; Holz, T.; Kruegel, C.; Vigna, G. VenoMave: Targeted Poisoning against Speech Recognition. arXiv 2020, arXiv:2010.10682. [Google Scholar]
Li, M.; Wang, X.; Huo, D.; Wang, H.; Liu, C.; Wang, Y.; Wang, Y.; Xu, Z. A Novel Trojan Attack against Co-learning Based ASR DNN System. In Proceedings of the 2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Dalian, China, 5–7 May 2021; pp. 907–912. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar]
Becker, S.; Ackermann, M.; Lapuschkin, S.; Müller, K.R.; Samek, W. Interpreting and explaining deep neural networks for classification of audio signals. arXiv 2018, arXiv:1807.03418. [Google Scholar]
Warden, P. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv 2018, arXiv:1804.03209. [Google Scholar]
Jati, A.; Hsu, C.C.; Pal, M.; Peri, R.; AbdAlmageed, W.; Narayanan, S. Adversarial attack and defense strategies for deep speaker recognition systems. Comput. Speech Lang. 2021, 68, 101199. [Google Scholar] [CrossRef]
Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-vectors: Robust dnn embeddings for speaker recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar]
Pedersen, P. The mel scale. J. Music Theory 1965, 9, 295–308. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 1–12. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Hampel, F.R. The influence curve and its role in robust estimation. J. Am. Stat. Assoc. 1974, 69, 383–393. [Google Scholar] [CrossRef]

Figure 1. Overview of our dynamic backdoor attack.

Figure 2. Anomaly index of BadNet, DriNet, and cthe lean model. The lower, the better; the blue-dotted line is the threshold used. SC denotes the speech commands dataset.

Figure 3. The visualization results of comparisons between poisoned samples and triggers. Three examples of poisoned audio (a–c) and the triggers added (d–f).

Figure 4. Three kinds of trigger sizes.

Table 1. Architecture of the trigger generation model.

Layers Type	Input	Filter	Stride	Activation
Conv1d	$f r e q \times n$	$f r e q \times 1$	1	ReLU
FC	n	$n \times 64$	-	ReLU
FC	64	$64 \times 128$	-	ReLU
Dropout	-	-	-	-
FC	128	$128 \times 128$	-	ReLU
Dropout	-	-	-	-
FC	128	$128 \times (f r e q * n)$	-	ReLU
Dropout	-	-	-	-
Reshape	$f r e q \times n$	$(f r e q * n) \times f r e q \times n$	-	sigmoid

Table 2. A summary of datasets.

Dataset	Zero	One	Two	Three	Four	Five	Six	Seven	Eight	Nine	Total
AudioMNIST	3000	3000	3000	3000	3000	3000	3000	3000	3000	3000	30,000
Speech Commands	4052	3890	3880	3727	3728	4052	3860	3998	3787	3934	38,908

Table 3. The results of the accuracy and attack success rate for the AudioMNIST dataset.

AudioMNIST
Model	Poisoning Rate	Trigger Size
		(1 × 5)		(2 × 5)		(3 × 5)
		ACC	ASR	ACC	ASR	ACC	ASR
1D CNN	0.005	0.998	0.697	0.998	0.758	1	0.864
	0.01	0.999	0.87	0.998	0.848	1	0.984
	0.03	0.999	0.983	0.999	0.988	0.999	0.989
	0.05	0.998	0.984	0.999	0.994	0.999	0.999
	0.1	0.998	0.987	0.999	0.997	0.999	0.995
	0.2	0.998	0.996	0.998	0.997	0.998	0.997
X-vector	0.005	0.997	0.676	0.997	0.516	0.997	0.633
	0.01	0.998	0.857	0.999	0.968	0.999	0.926
	0.03	0.998	0.983	0.998	0.957	0.998	0.977
	0.05	0.998	0.987	0.998	0.984	0.998	0.994
	0.1	0.997	0.985	0.999	0.999	0.998	0.997
	0.2	0.998	0.994	0.999	0.999	0.999	0.997

Table 4. The results of the accuracy and attack success rate for the speech commands dataset.

Speech Commands
Model	Poisoning Rate	Trigger Size
		(1 × 5)		(2 × 5)		(3 × 5)
		ACC	ASR	ACC	ASR	ACC	ASR
1D CNN	0.03	0.927	0.211	0.936	0.341	0.946	0.455
	0.05	0.917	0.411	0.926	0.467	0.924	0.541
	0.1	0.877	0.484	0.893	0.559	0.906	0.629
	0.2	0.834	0.617	0.852	0.662	0.871	0.722
	0.3	0.8	0.68	0.825	0.753	0.847	0.778
	0.4	0.79	0.762	0.815	0.81	0.845	0.824
	0.5	0.782	0.812	0.822	0.842	0.844	0.866
X-vector	0.03	0.931	0.105	0.941	0.342	0.949	0.522
	0.05	0.915	0.287	0.932	0.52	0.939	0.574
	0.1	0.891	0.466	0.905	0.627	0.92	0.705
	0.2	0.835	0.636	0.875	0.755	0.892	0.756
	0.3	0.812	0.727	0.851	0.797	0.863	0.807
	0.4	0.797	0.795	0.841	0.847	0.862	0.868
	0.5	0.811	0.861	0.84	0.888	0.861	0.898

Table 5. The performance comparison between BadNet and DriNet. We let the poisoning rate be 0.01 for the AudioMNIST dataset and 0.1 for the speech commands dataset. “×” means cannot bypass the defense. “√” means can bypass the defense.

Attack Method	Victim Model	AudioMNIST		Speech Commands		Neural Cleanse
Attack Method	Victim Model	ACC	ASR	ACC	ASR	Neural Cleanse
BadNets	1D CNN	0.999	0.956	0.912	0.677	×
BadNets	X-vector	0.999	0.949	0.915	0.658	×
DriNet	1D CNN	0.999	0.984	0.906	0.629	√
DriNet	X-vector	0.999	0.945	0.920	0.705	√

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, J.; Liu, X.; You, Z.; Li, G.; Liu, B. DriNet: Dynamic Backdoor Attack against Automatic Speech Recognization Models. Appl. Sci. 2022, 12, 5786. https://doi.org/10.3390/app12125786

AMA Style

Ye J, Liu X, You Z, Li G, Liu B. DriNet: Dynamic Backdoor Attack against Automatic Speech Recognization Models. Applied Sciences. 2022; 12(12):5786. https://doi.org/10.3390/app12125786

Chicago/Turabian Style

Ye, Jianbin, Xiaoyuan Liu, Zheng You, Guowei Li, and Bo Liu. 2022. "DriNet: Dynamic Backdoor Attack against Automatic Speech Recognization Models" Applied Sciences 12, no. 12: 5786. https://doi.org/10.3390/app12125786

APA Style

Ye, J., Liu, X., You, Z., Li, G., & Liu, B. (2022). DriNet: Dynamic Backdoor Attack against Automatic Speech Recognization Models. Applied Sciences, 12(12), 5786. https://doi.org/10.3390/app12125786

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DriNet: Dynamic Backdoor Attack against Automatic Speech Recognization Models

Abstract

1. Introduction

2. Related Work

2.1. Automatic Speech Recognition

2.2. Backdoor Attack

2.3. Defense for Backdoor Attack

3. Design of Dynamic Backdoor Attacks

3.1. Treat Model

3.2. Problem Formulation

3.3. Overview of Dynamic Backdoor Attack

3.4. Dynamic Trigger Generation

3.5. DriNet Algorithm

3.6. Poisoning

4. Experiments

4.1. Experimental Settings

4.2. Evaluation Metrics

4.3. Results

4.4. Visualization

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI