Adversarial Attack and Defense Strategies of Speaker Recognition Systems: A Survey
Abstract
:1. Introduction
- An overview of the existing works on SRS is presented to introduce the latest advances in adversarial examples against SRSs, and from the perspectives of example generation and example evaluation, we classify the adversarial examples by such indicators as the attack task, the perturbation target, the perturbation constraint strategies, and attack effect evaluation.
- We review and classify the existing attack and defense methods from three aspects: adversarial training, attack detection, and input refactoring, and measure the effectiveness of these methods by fidelity and robustness.
2. Background
2.1. Overview of SRS
- Training: over ten thousand audio clips from large amounts of speakers are used to train the speaker embedding module and obtain human voice feature distributions, regardless of single speakers;
- Enrolling: the enrolled speaker utterance is mapped onto a unique labeled speaker embedding through the speaker embedding module, and this high-dimensional feature vector is this speaker’s unique identity;
- Verification: the model scores the utterance of an unknown speaker by extracting high-dimensional feature vectors from the embedding module. The scoring module assesses the similarity between the recorded embedding and the speaker embedding, and the score and decision module is based on to judge whether the speaker is legitimate.
2.2. SR Task
2.2.1. CSI Task
2.2.2. OSI Task
2.2.3. SV Task
2.3. Victim Models
2.4. Datasets
- TIMIT: The standard dataset in the field of speech recognition is a relatively small dataset that enables the training and testing of models in a short period of time, and its database is manually annotated down to the phoneme, with speakers from all parts of the United States, and provides detailed speaker information, such as ethnicity, education, and even height.
- NTIMIT: The dataset that puts the audio data in TIMIT on a different telephone line for transmission and then reception is a dataset created to implement voice recognition in the telephone network.
- Aishell: Aishell-1 is the first large data volume Chinese dataset, with 178 h of speech, 400 speakers, 340 people in the training set, 20 people in the test set, and 40 people in the validation set, each of whom speaks about 300 sentences. Aishell-2 expands the data volume to 1000 h of speech, with 1991 speakers, each of whom speaks 500 sentences. The words spoken by each person may be repeated.
- LibriSpeech: The dataset is a large corpus containing approximately 1000 h of English speech. The data come from the audiobook recordings read by different readers of the LibriVox project, organized according to the sections of the audiobooks. It is segmented and correctly aligned.
- Voxceleb1,Voxceleb2: Two speaker recognition datasets without intersection, both of which are obtained from open source video sites captured by a set of fully automated programs based on computer vision technology development. They differ in size, with VoxCeleb2 compensating for the lack of ethnic diversity in VoxCeleb1 by being five times larger than VoxCeleb1 in terms of data size.
- YOHO: A speech dataset collected in an office environment that is text dependent, where the speaker speaks in a restricted textual combination.
- CSTR VCTK: A dataset including noisy and non-noisy speech with a sampling rate of 48 kHz and in which the speaker is accented.
3. Adversarial Attack
3.1. Overview of Adversarial Attack
- must be within a proper range such that the waveform can be recovered into an audio;
- must be as small as possible;
- SRSs will identify as the special target specified by the attacker beforehand (it can also be any other random target, but this is not meaningful).
3.2. Adversarial Task
- CSI: As mentioned in Section 2.2.1, close-set speaker recognition is a simple classification task and involves no thresholds. Thus, how to make the confidence coefficient of the decoded adversarial audio skew toward the target label is the key to attacks against the CSI models. To transfer adversarial attack algorithms from the field of image processing is a good choice.
- OSI: Different from CSI models, the OSI model uses the back-end scoring module to obtain a decision threshold, which is relied on to make the final judgments. If the internal structure and threshold are known, the perturbation can be scaled up to increase the attack intensity; if the internal parameters of the model are unavailable, how to identify the model decision threshold is a challenge in OSI attack tasks.
- SV: In adversarial attacks against SV models, we need only simulate the voice features of a single speaker to make the model score bigger than the threshold. Attacks against SV models are easier than attacks against OSI systems; however, for models whose internal parameters are unknown, the internal structure and decision threshold of the model should be considered.
3.3. Adversarial Knowledge
3.3.1. White-Box
- (1)
- Gradient-based attack.
- (2)
- Optimization-based attack
3.3.2. Grey-Box Attack
3.3.3. Black-Box Attack
3.4. Practicality
3.4.1. Over-the-Air
3.4.2. Commercial SRSs
3.5. Universal Attack
3.5.1. Data-Agnostic
3.5.2. Model-Agnostic
3.6. Perturbation Object
3.6.1. Time-Domain Perturbation
3.6.2. Frequency-Domain Perturbation
3.7. Perturbation Constraint
3.7.1. Perturbation Measurement
3.7.2. Psychoacoustic Masking
3.8. Attack Metrics
3.8.1. Effectiveness
3.8.2. Imperceptibility
4. Adversarial Defense
4.1. Adversarial Training
4.2. Attack Detection
4.3. Input Refactoring
4.3.1. Denoising
4.3.2. Noise Addition
4.4. Defense Metrics
5. Discussion
5.1. General Observations of Adversarial Audio
5.2. Challenges
5.3. Future Directions
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Gadekallu, T.R.; Manoj, M.K.; Kumar, N.; Hakak, S.; Bhattacharya, S. Blockchain-Based Attack Detection on Machine Learning Algorithms for IoT-Based e-Health Applications. IEEE Internet Things Mag. 2021, 4, 30–33. [Google Scholar] [CrossRef]
- Gu, Z.; Li, H.; Khan, S.; Deng, L.; Du, X.; Guizani, M.; Tian, Z. IEPSBP: A Cost-efficient Image Encryption Algorithm based on Parallel Chaotic System for Green IoT. IEEE Trans. Green Commun. Netw. 2021, 6, 89–106. [Google Scholar] [CrossRef]
- Gu, Z.; Wang, L.; Chen, X.; Tang, Y.; Wang, X.; Du, X.; Guizani, M.; Tian, Z. Epidemic risk assessment by a novel communication station based method. IEEE Trans. Netw. Sci. Eng. 2021, 9, 332–344. [Google Scholar] [CrossRef] [PubMed]
- Javed, A.R.; Ur Rehman, S.; Khan, M.U.; Alazab, M.; Reddy, T. CANintelliIDS: Detecting in-vehicle intrusion attacks on a controller area network using CNN and attention-based GRU. IEEE Trans. Netw. Sci. Eng. 2021, 8, 1456–1466. [Google Scholar] [CrossRef]
- Shafiq, M.; Tian, Z.; Bashir, A.K.; Jolfaei, A.; Yu, X. Data mining and machine learning methods for sustainable smart cities traffic classification: A survey. Sustain. Cities Soc. 2020, 60, 102177. [Google Scholar] [CrossRef]
- Shafiq, M.; Tian, Z.; Bashir, A.K.; Du, X.; Guizani, M. IoT malicious traffic identification using wrapper-based feature selection mechanisms. Comput. Secur. 2020, 94, 101863. [Google Scholar] [CrossRef]
- Farokhi, S.; Flusser, J.; Sheikh, U.U. Near, infrared. Face recognition: A literature survey. Comput. Sci. Rev. 2003, 21, 1–7. [Google Scholar] [CrossRef]
- Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. x-vectors: Robust dnn embeddings for speaker recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar]
- Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (Interspeech 2020), Shanghai, China, 25–29 October 2020; pp. 3830–3834. [Google Scholar]
- Pelecanos, J.; Wang, Q.; Moreno, I.L. Dr-Vectors: Decision residual networks and an improved loss for speaker recognition. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August–3 September 2021; pp. 4603–4607. [Google Scholar]
- Hautamäki, R.G.; Kinnunen, T.; Hautamäki, V.; Leino, T.; Laukkanen, A.M. I-vectors meet imitators: On vulnerability of speaker verification systems against voice mimicry. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (Interspeech 2013), Lyon, France, 25–29 August 2013; pp. 930–934. [Google Scholar]
- Godoy, E.; Rosec, O.; Chonavel, T. Voice Conversion Using Dynamic Frequency Warping With Amplitude Scaling, for Parallel or Nonparallel Corpora. Trans. Speech Audio Process. 2011, 4, 1313–1323. [Google Scholar] [CrossRef]
- Wu, Z.; Virtanen, T.; Kinnunen, T.; Chng, E.; Li, H. Exemplar-based unit selection for voice conversion utilizing temporal information. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH 2013), Lyon, France, 25–29 August 2013; pp. 3057–3061. [Google Scholar]
- Ze, H.; Senior, A.; Schuster, M. Statistical parametric speech synthesis using deep neural networks. In Proceedings of the 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013; pp. 7962–7966. [Google Scholar]
- Zen, H.; Tokuda, K.; Black, A.W. Statistical parametric speech synthesis. Speech Commun. 2009, 1, 1039–1064. [Google Scholar] [CrossRef]
- Jia, Y.; Zhang, Y.; Weiss, R.; Wang, Q.; Shen, J.; Ren, F.; Nguyen, P.; Pang, R.; Lopez, M.I.; Wu, Y. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada, 3–8 December 2018; pp. 4485–4495. [Google Scholar]
- Wang, Z.F.; Wei, G.; He, Q.H. Channel pattern noise based playback attack detection algorithm for speaker recognition. In Proceedings of the 2011 International Conference on Machine Learning and Cybernetics (ICMLC), Guilin, China, 10–13 July 2011; pp. 1708–1713. [Google Scholar]
- Syverson, P. A taxonomy of replay attacks [cryptographic protocols]. In Proceedings of the Computer Security Foundations Workshop. Franconia, NH, USA, 14–16 June 1994; pp. 187–191. [Google Scholar]
- Villalba, J.; Lleida, E. Preventing replay attacks on speaker verification systems. In Proceedings of the International Carnahan Conference on Security Technology (ICCST), Barcelona, Spain, 18–21 October 2011; pp. 1–8. [Google Scholar]
- Yoon, S.H.; Koh, M.S.; Park, J.H.; Yu, H.J. A new replay attack against automatic speaker verification systems. IEEE Access 2020, 8, 36080–36088. [Google Scholar] [CrossRef]
- Alegre, F.; Janicki, A.; Evans, N. Re-assessing the threat of replay spoofing attacks against automatic speaker verification. In Proceedings of the 13th International Conference of the Biometrics Special Interest Group, Darmstadt, Germany, 10–12 September 2014; pp. 1–6. [Google Scholar]
- Wu, Z.; Yamagishi, J.; Kinnunen, T.; Hanilçi, C.; Sahidullah, M.; Sizov, A.; Evans, N.; Todisco, M.; Delgado, H. ASVspoof: The automatic speaker verification spoofing and countermeasures challenge. IEEE J. Sel. Top. Signal Process. 2017, 4, 588–604. [Google Scholar] [CrossRef] [Green Version]
- Nautsch, A.; Wang, X.; Evans, N.; Kinnunen, T.H.; Vestman, V.; Todisco, M.; Delgado, H.; Sahidullah, M.; Yamagishi, J.; Lee, K.A. ASVspoof 2019: Spoofing countermeasures for the detection of synthesized, converted and replayed speech. IEEE Trans. Biom. Behav. Identity Sci. 2021, 3, 252–265. [Google Scholar] [CrossRef]
- Saha, A.; Subramanya, A.; Pirsiavash, H. Hidden trigger backdoor attacks. In Proceedings of the AAAI conference on artificial intelligence, New York, NY, USA, 7–12 February 2020; pp. 11957–11965. [Google Scholar]
- Chen, X.; Liu, C.; Li, B.; Lu, K.; Song, D. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv 2017, arXiv:1712.05526. [Google Scholar]
- Wang, B.; Yao, Y.; Shan, S.; Li, H.; Viswanath, B.; Zheng, H.; Zhao, B.Y. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2019; pp. 707–723. [Google Scholar]
- Zhai, T.; Li, Y.; Zhang, Z.; Wu, B.; Jiang, Y.; Xia, S.T. Backdoor attack against speaker verification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2560–2564. [Google Scholar]
- Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Carlini, N.; Wagner, D. Towards evaluating the robustness of neural networks. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 22–26 May 2017; pp. 39–57. [Google Scholar]
- Dong, Y.; Fu, Q.A.; Yang, X.; Pang, T.; Su, H.; Xiao, Z.; Zhu, J. Benchmarking adversarial robustness on image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 321–331. [Google Scholar]
- Wang, J. Adversarial examples in physical world. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI), Montreal, ON, Canada, 19–27 August 2021; pp. 4925–4926. [Google Scholar]
- Zhang, W.E.; Sheng, Q.Z.; Alhazmi, A.; Li, C. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Trans. Intell. Syst. Technol. 2020, 11, 1–41. [Google Scholar] [CrossRef] [Green Version]
- Zhu, B.; Gu, Z.; Qian, Y.; Lau, F.; Tian, Z. Leveraging Transferability and Improved Beam Search in Textual Adversarial Attacks. Neurocomputing 2022, 500, 135–142. [Google Scholar] [CrossRef]
- Carlini, N.; Wagner, D. Audio adversarial examples: Targeted attacks on speech-to-text. In Proceedings of the 2018 IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 24 May 2018; pp. 1–7. [Google Scholar]
- Xu, H.; Ma, Y.; Liu, H.C.; Deb, D.; Liu, H.; Tang, J.L.; Jain, A.K. Adversarial attacks and defenses in images, graphs and text: A review. Int. J. Autom. Comput. 2020, 17, 151–178. [Google Scholar] [CrossRef] [Green Version]
- Hu, S.; Shang, X.; Qin, Z.; Li, M.; Wang, Q.; Wang, C. Adversarial examples for automatic speech recognition: Attacks and countermeasures. IEEE Commun. Mag. 2019, 57, 120–126. [Google Scholar] [CrossRef]
- Cisse, M.; Adi, Y.; Neverova, N.; Keshet, J. Houdini: Fooling deep structured prediction models. arXiv 2017, arXiv:1707.05373. [Google Scholar]
- Mode, G.R.; Hoque, K.A. Crafting adversarial examples for deep learning based prognostics. In Proceedings of the 19th IEEE International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 14–17 December 2020; pp. 467–472. [Google Scholar]
- Wang, D.; Wang, R.; Dong, L.; Yan, D.; Zhang, X.; Gong, Y. Adversarial examples attack and countermeasure for speech recognition system: A survey. In Proceedings of the International Conference on Security and Privacy in Digital Economy, Singapore, 30 October 2020; pp. 443–468. [Google Scholar]
- Carlini, N.; Mishra, P.; Vaidya, T.; Zhang, Y.; Sherr, M.; Shields, C.; Wagner, D.; Zhou, W. Hidden voice commands. In Proceedings of the 25th USENIX Security Symposium (USENIX Security 16), Austin, TX, USA, 10–12 August 2016; pp. 513–530. [Google Scholar]
- Das, R.K.; Tian, X.; Kinnunen, T.; Li, H. The attacker’s perspective on automatic speaker verification: An overview. Interspeech 2020. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 25–29 October 2020; pp. 4213–4217. [Google Scholar]
- Abdullah, H.; Warren, K.; Bindschaedler, V.; Papernot, N.; Traynor, P. Sok: The faults in our asrs: An overview of attacks against automatic speech recognition and speaker identification systems. In Proceedings of the IEEE Symposium on Security and Privacy (SP 2021), San Francisco, CA, USA, 24–27 May 2021; pp. 730–747. [Google Scholar]
- Chen, X.; Li, S.; Huang, H. Adversarial Attack and Defense on Deep Neural Network-Based Voice Processing Systems: An Overview. Appl. Sci. 2021, 11, 8450. [Google Scholar] [CrossRef]
- Sakoe, H.; Chiba, S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. 1978, 26, 43–49. [Google Scholar] [CrossRef] [Green Version]
- Reynolds, D.A.; Rose, R.C. Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 1995, 3, 72–83. [Google Scholar] [CrossRef] [Green Version]
- Reynolds, D.A.; Quatieri, T.F.; Dunn, R.B. Speaker verification using adapted Gaussian mixture models. Digit. Signal Process. 2000, 10, 19–41. [Google Scholar] [CrossRef] [Green Version]
- Dehak, N.; Kenny, P.J.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 2010, 19, 788–798. [Google Scholar] [CrossRef]
- Variani, E.; Lei, X.; McDermott, E.; Moreno, I.L.; Gonzalez-Dominguez, J. Deep neural networks for small footprint text-dependent speaker verification. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP 2014), Florence, Italy, 4–9 May 2014; pp. 4052–4056. [Google Scholar]
- Bai, Z.; Zhang, X.L. Speaker recognition based on deep learning: An overview. Neural Netw. 2021, 140, 65–99. [Google Scholar] [CrossRef] [PubMed]
- Muda, L.; Begam, M.; Elamvazuthi, I. Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. arXiv 2010, arXiv:1003.4083. [Google Scholar]
- Hermansky, H. Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 1990, 87, 1738–1752. [Google Scholar] [CrossRef] [Green Version]
- Nandwana, M.K.; Ferrer, L.; McLaren, M.; Castan, D.; Lawson, A. Analysis of Critical Metadata Factors for the Calibration of Speaker Recognition Systems. In Proceedings of the 20th Annual Conference of the International Speech Communication Association (Interspeech 2019), Graz, Austria, 15–19 September 2019; pp. 4325–4329. [Google Scholar]
- Dehak, N.; Dehak, R.; Glass, J.R.; Reynolds, D.A.; Kenny, P. Cosine similarity scoring without score normalization techniques. In Proceedings of the Odyssey 2010: The Speaker and Language Recognition Workshop, Brno, Czech Republic, 28 June–1 July 2010; p. 15. [Google Scholar]
- Wang, D. A simulation study on optimal scores for speaker recognition. EURASIP J. Audio Speech Music. Process. 2020, 1, 1–23. [Google Scholar] [CrossRef]
- Hansen, J.H.; Hasan, T. Speaker recognition by machines and humans: A tutorial review. IEEE Signal Process. Mag. 2015, 32, 74–99. [Google Scholar] [CrossRef]
- Jati, A.; Georgiou, P. Neural predictive coding using convolutional neural networks toward unsupervised learning of speaker characteristics. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1577–1589. [Google Scholar] [CrossRef]
- Chen, G.; Zhao, Z.; Song, F.; Chen, S.; Fan, L.; Liu, Y. SEC4SR: A security analysis platform for speaker recognition. arXiv 2021, arXiv:2109.01766. [Google Scholar]
- Dehak, N.; Dehak, R.; Kenny, P.; Brümmer, N.; Ouellet, P.; Dumouchel, P. Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. In Proceedings of the 10th Annual Conference of the International Speech Communication (INTERSPEECH 2009), Association, Brighton, UK, 6–10 September 2009; pp. 1559–1562. [Google Scholar]
- Zeinali, H.; Wang, S.; Silnova, A.; Matějka, P.; Plchot, O. But system description to voxceleb speaker recognition challenge 2019. arXiv 2019, arXiv:1910.12592. [Google Scholar]
- Ravanelli, M.; Bengio, Y. Speaker recognition from raw waveform with sincnet. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT 2018), Athens, Greece, 18–21 December 2018; pp. 1021–1028. [Google Scholar]
- Chung, J.S.; Nagrani, A.; Zisserman, A. Voxceleb2: Deep speaker recognition. In Proceedings of the 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2–6 September 2018; pp. 1086–1090. [Google Scholar]
- Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S. Getting started with the DARPA TIMIT CD-ROM: An acoustic phonetic continuous speech database. Natl. Inst. Stand. Technol. (NIST) 1988, 107, 16. [Google Scholar]
- Jankowski, C.; Kalyanswamy, A.; Basson, S.; Spitz, J. NTIMIT: A phonetically balanced, continuous speech, telephone bandwidth speech database. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1990), Albuquerque, NM, USA, 3–6 April 1990; pp. 109–112. [Google Scholar]
- Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA 2017), Seoul, Korea, 1–3 November 2017; pp. 1–5. [Google Scholar]
- Du, J.; Na, X.; Liu, X.; Bu, H. Aishell-2: Transforming mandarin asr research into industrial scale. arXiv 2018, arXiv:1808.10583. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. LibriSpeech: An asr corpus based on public domain audio books. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2015), South Brisbane, QSD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
- Nagrani, A.; Chung, J.S.; Xie, W.; Zisserman, A. Voxceleb: Large-scale speaker verification in the wild. Comput. Speech Lang. 2020, 60, 101027. [Google Scholar] [CrossRef]
- Campbell, J.P. Testing with the YOHO CD-ROM voice verification corpus. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1995), Detroit, MI, USA, 8–12 May 1995; pp. 341–344. [Google Scholar]
- Yamagishi, J.; Veaux, C.; MacDonald, K. CSTR VCTK Corpus: English Multi-Speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92); The Centre for Speech Technology Research (CSTR), University of Edinburgh: Edinburgh, Scotland, 2019. [Google Scholar]
- Kreuk, F.; Adi, Y.; Cisse, M.; Keshet, J. Fooling end-to-end speaker verification with adversarial examples. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), Calgary, AB, Canada, 15–20 April 2018; pp. 1962–1966. [Google Scholar]
- Abdullah, H.; Rahman, M.S.; Garcia, W.; Warren, K.; Yadav, A.S.; Shrimpton, T.; Traynor, P. Hear “No Evil”, See “Kenansville”*: Efficient and Transferable Black-Box Attacks on Speech Recognition and Voice Identification Systems. In Proceedings of the 42nd IEEE Symposium on Security and Privacy (SP 2021), San Francisco, CA, USA, 24–27 May 2021; pp. 712–729. [Google Scholar]
- Li, X.; Zhong, J.; Wu, X.; Yu, J.; Li, X.; Meng, H. Adversarial attacks on GMM i-vector based speaker verification systems. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2020), Barcelona, Spain, 4–8 May 2020; pp. 6579–6583. [Google Scholar]
- Villalba, J.; Zhang, Y.; Dehak, N. x-vectors Meet Adversarial Attacks: Benchmarking Adversarial Robustness in Speaker Verification. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (Interspeech 2020), Shanghai, China, 25–29 October 2020; pp. 4233–4237. [Google Scholar]
- Jati, A.; Hsu, C.; Pal, M.; Peri, R.; AbdAlmageed, W.; Narayanan, S. Adversarial attack and defense strategies for deep speaker recognition systems. Comput. Speech Lang. 2021, 68, 101199. [Google Scholar] [CrossRef]
- Joshi, S.; Villalba, J.; Zelasko, P.; Moro-Velázquez, L.; Dehak, N. Study of Pre-Processing Defenses Against Adversarial Attacks on State-of-the-Art Speaker Recognition Systems. IEEE Trans. Inf. Forensics Secur. 2021, 16, 4811–4826. [Google Scholar] [CrossRef]
- Zhang, W.; Zhao, S.; Liu, L.; Li, J.; Cheng, X.; Zheng, T.; Hu, X. Attack on Practical Speaker Verification System Using Universal Adversarial Perturbations. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2021), Toronto, ON, Canada, 6–11 June 2021; pp. 2575–2579. [Google Scholar]
- Liu, S.; Wu, H.; Lee, H.Y.; Meng, H. Adversarial attacks on spoofing counter measures of automatic speaker verification. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019), Singapore, 14–18 December 2019; pp. 312–319. [Google Scholar]
- Zhang, Y.; Jiang, Z.; Villalba, J.; Dehak, N. Black-Box Attacks on Spoofing Countermeasures Using Transferability of Adversarial Examples. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (Interspeech 2020), Shanghai, China, 25–29 October 2020; pp. 4238–4242. [Google Scholar]
- Goto, K.; Inoue, N. Quasi-Newton Adversarial Attacks on Speaker Verification Systems. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA 2020), Auckland, New Zealand, 7–10 December 2020; pp. 527–531. [Google Scholar]
- Chen, G.; Chenb, S.; Fan, L.; Du, X.; Zhao, Z.; Song, F.; Liu, Y. Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems. In Proceedings of the 42nd IEEE Symposium on Security and Privacy (SP 2021), San Francisco, CA, USA, 24–27 May 2021; pp. 694–711. [Google Scholar]
- Li, J.; Zhang, X.; Xu, J.; Ma, S.; Gao, W. Learning to Fool the Speaker Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020), Barcelona, Spain, 4–8 May 2020; pp. 2937–2941. [Google Scholar]
- Luo, H.; Shen, Y.; Lin, F.; Xu, G. Spoofing Speaker Verification System by Adversarial Examples Leveraging the Generalized Speaker Difference. Secur. Commun. Netw. 2021, 2021, 6664578. [Google Scholar] [CrossRef]
- Marras, M.; Korus, P.; Memon, N.D.; Fenu, G. Adversarial Optimization for Dictionary Attacks on Speaker Verification. In Proceedings of the 20th Annual Conference of the International Speech Communication Association (Interspeech 2019), Graz, Austria, 15–19 September 2019; pp. 2913–2917. [Google Scholar]
- Zhang, L.; Meng, Y.; Yu, J.; Xiang, C.; Falk, B.; Zhu, H. Voiceprint mimicry attack towards speaker verification system in smart home. In Proceedings of the 39th IEEE Conference on Computer Communications (INFOCOM 2020), Toronto, ON, Canada, 6–9 July 2020; pp. 377–386. [Google Scholar]
- Abdullah, H.; Garcia, W.; Peeters, C.; Traynor, P.; Butler, K.R.; Wilson, J. Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems. In Proceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS 2019), San Diego, CA, USA, 24–27 February 2019. [Google Scholar]
- Du, T.; Ji, S.; Li, J.; Gu, Q.; Wang, T.; Beyah, R. Sirenattack: Generating adversarial audio for end-to-end acoustic systems. In Proceedings of the 15th ACM Asia Conference on Computer and Communications Security (ASIA CCS 2020), Taipei, Taiwan, 5–9 October 2020; pp. 357–369. [Google Scholar]
- Li, Z.; Wu, Y.; Liu, J.; Chen, Y.; Yuan, B. Advpulse: Universal, synchronization-free, and targeted audio adversarial attacks via subsecond perturbations. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security (CCS 2020), Virtual Event, 9–13 November 2020; pp. 1121–1134. [Google Scholar]
- Xie, Y.; Shi, C.; Li, Z.; Liu, J.; Chen, Y.; Yuan, B. Real-time, universal, and robust adversarial attacks against speaker recognition systems. In Proceedings of the 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP 2020), Barcelona, Spain, 4–8 May 2020; pp. 1738–1742. [Google Scholar]
- Chen, G.; Zhao, Z.; Song, F.; Chen, S.; Fan, L.; Liu, Y. AS2T: Arbitrary source-to-target adversarial attack on speaker recognition systems. arXiv 2022, arXiv:2206.03351. [Google Scholar] [CrossRef]
- Zheng, B.; Jiang, P.; Wang, Q.; Li, Q.; Shen, C.; Wang, C.; Ge, Y.; Teng, Q.; Zhang, S.; Zhang, S. Black-box adversarial attacks on commercial speech platforms with minimal information. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, 15–19 November 2021; pp. 86–107. [Google Scholar]
- Li, Z.; Shi, C.; Xie, Y.; Liu, J.; Yuan, B.; Chen, Y. Practical Adversarial Attacks Against Speaker Recognition Systems. In Proceedings of the HotMobile ’20: The 21st International Workshop on Mobile Computing Systems and Applications, Austin, TX, USA, 3–4 March 2020; pp. 9–14. [Google Scholar]
- Li, J.; Zhang, X.; Jia, C.; Xu, J.; Zhang, L.; Wang, Y.; Ma, S.; Gao, W. Universal Adversarial Perturbations Generative Network For Speaker Recognition. In Proceedings of the IEEE International Conference on Multimedia and Expo, ICME 2020, London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar]
- Xie, Y.; Li, Z.; Shi, C.; Liu, J.; Chen, Y.; Yuan, B. Real-time, Robust and Adaptive Universal Adversarial Attacks Against Speaker Recognition Systems. J. Signal Process. Syst. 2021, 93, 1187–1200. [Google Scholar] [CrossRef]
- Tan, H.; Zhang, J.; Zhang, H.; Wang, L.; Qian, Y.; Gu, Z. NRI-FGSM: An Efficient Transferable Adversarial Attack Method for Speaker Recognition System. In Proceedings of the 23st Annual Conference of the International Speech Communication Association (Interspeech 2022), Incheon, Korea, 18–22 September 2022. [Google Scholar]
- Shamsabadi, A.S.; Teixeira, F.S.; Abad, A.; Raj, B.; Cavallaro, A.; Trancoso, I. FoolHD: Fooling Speaker Identification by Highly Imperceptible Adversarial Disturbances. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP 2021), Toronto, ON, Canada, 6–11 June 2021; pp. 6159–6163. [Google Scholar]
- Wang, Q.; Guo, P.; Xie, L. Inaudible adversarial perturbations for targeted attack in speaker recognition. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (Interspeech 2020), Shanghai, China, 25–29 October 2020; pp. 4228–4232. [Google Scholar]
- Xie, Y.; Li, Z.; Shi, C.; Liu, J.; Chen, Y.; Yuan, B. Enabling fast and universal audio adversarial attack using generative model. In Proceedings of the AAAI Conference on Artificial Intelligence (EAAI 2021), Virtual Event, 2–9 February 2021; pp. 14129–14137. [Google Scholar]
- Tramèr, F.; Kurakin, A.; Papernot, N.; Goodfellow, I.; Boneh, D.; McDaniel, P. Ensemble adversarial training: Attacks and defenses. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Kurakin, A.; Goodfellow, I.; Bengio, S. Adversarial examples in the physical world. In Proceedings of the Workshop of the 5th International Conference on Learning Representations (ICLR-2017), Toulon, France, 24–26 April 2017. [Google Scholar]
- Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, V. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of the 6th International Conference on Learning Representations (ICLR-2018), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Dong, Y.; Liao, F.; Pang, T.; Su, H.; Zhu, J.; Hu, X.; Li, J. Boosting adversarial attacks with momentum. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–22 June 2018; pp. 9185–9193. [Google Scholar]
- Ilyas, A.; Engstrom, L.; Athalye, A.; Lin, J. Black-box Adversarial Attacks with Limited Queries and Information. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholmsmässan, Stockholm, Sweden, 10–15 July 2018; pp. 2142–2151. [Google Scholar]
- Chen, P.Y.; Zhang, H.; Sharma, Y.; Yi, J.; Hsieh, C.J. ZOO: Zeroth Order Optimization Based Black-box Attacks to Deep Neural Networks without Training Substitute Models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (AISec@CCS 2017), Dallas, TX, USA, 3 November 2017; pp. 15–26. [Google Scholar]
- Stan, G.B.; Embrechts, J.J.; Archambeau, D. Comparison of different impulse response measurement techniques. J. Audio Eng. Soc. 2002, 50, 249–262. [Google Scholar]
- Machado, G.R.; Silva, E.; Goldschmidt, R.R. Adversarial Machine Learning in Image Classification: A Survey Toward the Defender’s Perspective. ACM Comput. Surv. (CSUR) 2021, 55, 1–38. [Google Scholar] [CrossRef]
- Zhang, J.; Tan, H.; Deng, B.; Hu, J.; Zhu, D.; Huang, L.; Gu, Z. NMI-FGSM-Tri: An Efficient and Targeted Method for Generating Adversarial Examples for Speaker Recognition. In Proceedings of the Sixth IEEE International Conference on Data Science in Cyberspace (DSC 2022), Guilin, China, 11–13 July 2022. [Google Scholar]
- Hu, Y.; Loizou, P.C. Evaluation of Objective Quality Measures for Speech Enhancement. IEEE Trans. Speech Audio Process. 2008, 16, 229–238. [Google Scholar] [CrossRef]
- Amazon Mechanical Turk Platform. Available online: https://www.mturk.com (accessed on 2 November 2005).
- Delgado, H.; Evans, N.; Kinnunen, T.; Lee, K.A.; Liu, X.; Nautsch, A.; Patino, J.; Sahidullah, M.; Todisco, M.; Wang, X.; et al. ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan. arXiv 2021, arXiv:2109.00535. [Google Scholar]
- Akhtar, N.; Mian, A. Threat of adversarial attacks on deep learning in computer vision: A survey. IEEE Access 2018, 6, 14410–14430. [Google Scholar] [CrossRef]
- Wang, Q.; Guo, P.; Sun, S.; Xie, L.; Hansen, J.H. Adversarial Regularization for End-to-End Robust Speaker Verification. In Proceedings of the 20th Annual Conference of the International Speech Communication Association (Interspeech 2019), Graz, Austria, 15–19 September 2019; pp. 4010–4014. [Google Scholar]
- Wu, H.; Liu, S.; Meng, H.; Lee, H.Y. Defense against adversarial attacks on spoofing countermeasures of ASV. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020), Barcelona, Spain, 4–8 May 2020; pp. 6564–6568. [Google Scholar]
- Pal, M.; Jati, A.; Peri, R.; Hsu, C.C.; AbdAlmageed, W.; Narayanan, S. Adversarial defense for deep speaker recognition using hybrid adversarial training. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021), Toronto, ON, Canada, 6–11 June 2021; pp. 6164–6168. [Google Scholar]
- Li, X.; Li, N.; Zhong, J.; Wu, X.; Liu, X.; Su, D.; Yu, D.; Meng, H. Investigating robustness of adversarial samples detection for automatic speaker verification. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (Interspeech 2020), Shanghai, China, 25–29 October 2020; pp. 1540–1544. [Google Scholar]
- Villalba, J.; Joshi, S.; Żelasko, P.; Dehak, N. Representation Learning to Classify and Detect Adversarial Attacks against Speaker and Speech Recognition Systems. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association (Interspeech 2021), Brno, Czechia, 30 August–3 September 2021; pp. 4304–4308. [Google Scholar]
- Peng, Z.; Li, X.; Lee, T. Pairing weak with strong: Twin models for defending against adversarial attack on speaker verification. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association (Interspeech 2021), Brno, Czechia, 30 August–3 September 2021; pp. 4284–4288. [Google Scholar]
- Wu, H.; Hsu, P.C.; Gao, J.; Zhang, S.; Huang, S.; Kang, J.; Wu, Z.; Meng, H.; Lee, H.Y. Spotting adversarial samples for speaker verification by neural vocoders. arXiv 2021, arXiv:2107.00309. [Google Scholar]
- Joshi, S.; Kataria, S.; Villalba, J.; Dehak, N. AdvEst: Adversarial Perturbation Estimation to Classify and Detect Adversarial Attacks against Speaker Identification. arXiv 2022, arXiv:2204.03848. [Google Scholar]
- Zhang, H.; Wang, L.; Zhang, Y.; Liu, M.; Lee, K.A.; Wei, J. Adversarial Separation Network for Speaker Recognition. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (Interspeech 2020), Shanghai, China, 25–29 October 2020; pp. 951–955. [Google Scholar]
- Wu, H.; Liu, A.T.; Lee, H.Y. Defense for black-box attacks on anti-spoofing models by self-supervised learning. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (Interspeech 2020), Shanghai, China, 25–29 October 2020; pp. 3780–3784. [Google Scholar]
- Wu, H.; Li, X.; Liu, A.T.; Wu, Z.; Meng, H.; Lee, H.Y. Improving the adversarial robustness for speaker verification by self-supervised learning. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 202–217. [Google Scholar] [CrossRef]
- Wu, H.; Zhang, Y.; Wu, Z.; Wang, D.; Lee, H.Y. Voting for the right answer: Adversarial defense for speaker verification. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association (Interspeech 2021), Brno, Czechia, 30 August–3 September 2021; pp. 4294–4298. [Google Scholar]
- Wu, H.; Li, X.; Liu, A.T.; Wu, Z.; Meng, H.; Lee, H.Y. Adversarial defense for automatic speaker verification by cascaded self-supervised learning models. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021), Toronto, ON, Canada, 6–11 June 2021; pp. 6718–6722. [Google Scholar]
- Olivier, R.; Raj, B.; Shah, M. High-Frequency Adversarial Defense for Speech and Audio. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021), Toronto, ON, Canada, 6–11 June 2021; pp. 2995–2999. [Google Scholar]
- Chang, L.C.; Chen, Z.; Chen, C.; Wang, G.; Bi, Z. Defending Against Adversarial Attacks in Speaker Verification Systems. In Proceedings of the IEEE International Performance, Computing, and Communications Conference (IPCCC 2021), Austin, TX, USA, 29–31 October 2021; pp. 1–8. [Google Scholar]
- Miyato, T.; Dai, A.M.; Goodfellow, I. Adversarial training methods for semi-supervised text classification. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017. [Google Scholar]
- Miyato, T.; Maeda, S.I.; Koyama, M.; Nakae, K.; Ishii, S. Distributional smoothing with virtual adversarial training. arXiv 2015, arXiv:1507.00677. [Google Scholar]
- Wan, L.; Wang, Q.; Papir, A.; Moreno, I.L. Generalized end-to-end loss for speaker verification. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), Calgary, AB, Canada, 15–20 April 2018; pp. 4879–4883. [Google Scholar]
- Zhang, H.; Wang, J. Defense against adversarial attacks using feature scattering-based adversarial training. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 1829–1839. [Google Scholar]
- Yamamoto, R.; Song, E.; Kim, J.M. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020), Barcelona, Spain, 4–8 May 2020; pp. 6199–6203. [Google Scholar]
- Zhu, X.; Beauregard, G.T.; Wyse, L.L. Real-time signal estimation from modified short-time Fourier transform magnitude spectra. IEEE Trans. Speech Audio Process. 2007, 15, 1645–1653. [Google Scholar] [CrossRef]
- Hubert, M.; Debruyne, M.; Rousseeuw, P.J. Minimum covariance determinant and extensions. Wiley Interdiscip. Rev. Comput. Stat. 2018, 10, e1421. [Google Scholar] [CrossRef] [Green Version]
- Chen, Z. On the Detection of Adaptive Adversarial Attacks in Speaker Verification Systems. arXiv 2022, arXiv:2202.05725. [Google Scholar]
- Yang, Z.; Li, B.; Chen, P.Y.; Song, D. Characterizing audio adversarial examples using temporal dependency. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Liu, A.T.; Yang, S.W.; Chi, P.H.; Hsu, P.C.; Lee, H.Y. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020), Barcelona, Spain, 4–8 May 2020; pp. 6419–6423. [Google Scholar]
- AdvBox. Available online: https://github.com/Ewenwan/AdvBox (accessed on 4 September 2018).
- TorchAttack. Available online: https://adversarial-attacks-pytorch.readthedocs.io/en/latest (accessed on 15 July 2020).
- FoolBox. Available online: https://github.com/bethgelab/foolbox (accessed on 22 September 2020).
- TextAttack. Available online: https://github.com/qdata/textattack (accessed on 20 November 2020).
- OpenAttack. Available online: https://github.com/thunlp/openattack (accessed on 6 August 2021).
Strategy | Model | Dataset | Task | Metrics | Performance |
---|---|---|---|---|---|
Statistics | GMM-UBM | NIST SRE | OSI/SV | EER | 1.81% |
i-vector | NIST 2008 | OSI/SV | EER | 6.3% | |
Embedding | AudioNet | LibriSpeech | CSI | ACC | 99.7% |
VGGvox | Voxceleb1 | CSI | ACC | 92.1% | |
d-Vector | Google data | OSI/SV | EER | 4.54% | |
x-vector | VoxCeleb | OSI/SV | EER | 4.16% | |
r-Vector | VoxCeleb | OSI/SV | EER | 1.49% | |
SincNet | LibriSpeech | OSI/SV | EER | 0.96% | |
ECAPA-TDNN | VoxCeleb2 | OSI/SV | EER | 0.87% | |
End to End | ResCNN | MTurk | OSI/SV | EER | 2.83% |
GRU | MTurk | OSI/SV | EER | 2.78% |
Datasets | Sample Rate | Data Size | Spk Num | Language | Text Dependency | Condition |
---|---|---|---|---|---|---|
TIMIT | 16 kHz | 6300 sentences | 630 | English | TI | Clean |
NTIMIT | 8 kHz | 6300 sentences | 630 | English | TI | Telephone line |
Aishell | 16 kHz | 178 h | 400 | Chinese | TI | No noise |
LibriSpeech | 16 kHz | 153,516 utterances | >9000 | English | TI | / |
VoxCeleb1 | 16 kHz | 1,128,246 utterances | 1251 | English | TI | Multi-media |
VoxCeleb2 | - | 100 w sentences | 6112 | Multilingual | TI | Multi-media |
YOHO | 8 kHz | 5500 phrases | 138 | English | TD | Office |
CSTR VCTK | 48 kHz | 1000 sentences | 30 | English | TD | Wild |
Methods | Target | Capability | Knowledge | Generate Strategy | Perturbation Object | Metrics | OTA | Victim Model | Corpus |
---|---|---|---|---|---|---|---|---|---|
SEC4SR [58] | Both | Individual | White Black | Gradient Sign Optimization Evolutionary | Mel-Spec | ASR/SNR PESQ | Digital Physical | AudioNet GMM i-vecor x-vector | LibriSpeech |
Kreuk [71] | Untarget | Individual | White Black | Gradient Sign | Mel-Spec MFCC | ACC | Digital | End-to-end | YOHO NTIMIT |
Abdullah [72] | Target | Individual | Black | Audio process | MFCC | ASR | Digital | Microsoft Azure | - |
Li [73] | Target | Individual | White Black | Gradient Sign | LPMS MFCC | EER | Digital | i-Vector x-vector | VoxCeleb1 |
Villalba [74] | Both | Individual | White Black | Optimization | MFCC | EER/SNR minDCF PESQ | Digital | ResNet34 ThinResNet34 TDNN | Voxceleb1&2 |
Jati [75] | Both | Individual | White | Gradient Sign | Mel-Spec STFT | ASR | Digital | 1D-CNN TDNN | LibriSpeech |
Joshi [76] | Both | Universal | White | Gradient Sign Optimization | Fbank | ACC | Digital | ResNet34 Transformer x-vector | LibriSpeech VoxCeleb |
Two-step [77] | Target | Universal | White | Optimization | - | ASR/WER CER/SNR | Digital Physical | VGG Thin-ResNet-34 Fast-ResNet | LibriSpeech Voxceleb2 |
Liu [78] | Target | Individual | White | Gradient Sign | Mel-Spec | EER min-tDCF | Digital | LCNN-Big LCNN-Small SeNet | ASVspoof 2019 |
MI-FGSM [79] | Target | Individual | Black | Gradient Sign | log-power magnitude spectrum | ASR | Digital | LCNN/AFNet SENet50 ResNet34 | ASVspoof 2019 |
Quasi [80] | Both | Individual | White | Optimization | MFCC | EER | Digital | GMM i-vector | Voxceleb1 |
FakeBob [81] | Both | Individual | Black | Evolutionary | PLP MFCC | ACC SNR | Digital Physical | GMM-UBM i-vector x-vector | LibriSpeech |
Li [82] | Both | Individual | White | Optimization | Waveform | ASR/SNR PESQ | Digital | SincNet | TIMIT |
GE2E [83] | Target | Individual | White | Optimization | Feats | SR/SNR MNR | Digital | d-vector | TIMIT |
Dictionary [84] | Target | Individual | White | Dictionary | Mel-Spec | SR | Digital | VGGvox | VoxCeleb2 |
VMask [85] | Target | Individual | Grey Black | Gradient Sign Optimization | Mel-Spec | WER/SER SNR | Digital Physical | VGGVox | LibriSpeech |
Abdullah [86] | Untarget | Individual | Black | Feature Process | MFCC | ASR | Digital | End-to-end | TIMIT LibriSpeech |
Siren [87] | Target | Individual | White Black | Evolutionary | MFCC | ASR SNR | Digital | End-to-end | VCTK IEMOCAP |
AdvPulse [88] | Target | Individual | White | Optimization | MFCC | ASR | Physical | x-vector | VCTK |
Xie [89] | Both | Universal | White | Gradient Sign | MFCC | ASR | Simulated | x-vector | VCTK |
AS2T [90] | Both | Individual | White Black | Gradient Sign Optimization | Waveform | ASR/SNR PESQ/ | Digital Physical | Open source SRSs | LibriSpeech |
Occam [91] | Target | Individual | Black | Optimization | - | ASR/SNR | API | Commercial SRSs | LibriSpeech |
Li [92] | Both | Individual | White | Gradient Sign Optimization | Waveform | ACC ASR | Digital Physical | x-vector | VCTK |
UAPG [93] | Target | Universal | White | Optimization | MFCC | FR/SR | Digital | x-vector | VCTK |
Xie [94] | Both | Universal | White | Gradient Sign | MFCC | ASR | Simulated | x-vector | VCTK |
NRI-FGSM [95] | Target | Individual | Black | Gradient Sign | Waveform | ASR/SNR PESQ/ | Digital | x-vector ECAPA | LibriSpeech |
FoolHD [96] | Target | Individual | White | Optimization | MFCC | ASR/JND PESQ | Digital | - | Voxceleb |
Inaudible [97] | Target | Individual | White | Gradient Sign | Waveform | ASR | Digital | x-vector | Aishell-1 |
UAPs [98] | Both | Universal | White | Gradient Sign | Waveform | SER/PTR SNR PESQ | Digital | End-to-end | TIMIT LibriSpeech |
Ref. | Task | UT/T | OTA | Distance (m) | Methods | ASR (%) | Generation Model | Victim model | Commercial System | Defense | Time |
---|---|---|---|---|---|---|---|---|---|---|---|
[77] | SV | T | Real | 1.7 | RIR | 67.7 | Res34-V | Res34-V | - | RD | - |
[81] | OSI/SV | T | Real | 0.25–8 | NES+RIR | 100 | GMM | - | laptop (Dell) Shinco OPPO, JBL iphone 6 plus | LS, QT AS, TDD | - |
[86] | CSI/SV | T | Real | 0.3 | TDI, RPG, HFA, TS | 100 | - | - | Azure | VAD | - |
[88] | SI | T | Real | 1.6∼3 | RIR+BPF | 96.9 | ? | - | TKGOU Honda | - | - |
[89] | CSI | T | Real | - | RIR | 90.19 | x-vector (Tensorflow) | x-vector (Kaldi) | - | - | 0.015 s |
[90] | CSI | UT | Sim. Real | - | RIR+RN | 97.4 | ? | ? | - | - | - |
[91] | SI/SV | T | Real | 0.15 | GI | 71.7 | ASplRE | - | Google Assistant Siri, iFlytek Cortana Amazon Echo | - | - |
[92] | SV | UT+T | Real | - | GA+RIR | 50 | x-vector | - | Talentedsoft Azure | - | - |
Categories | Methods | Metrics (%) | Evaluation | ||||||
---|---|---|---|---|---|---|---|---|---|
Model | Dataset | Performance | |||||||
Baseline | Attack | Defense | |||||||
Adversarial Training | Wang [112] | EER | GE2E-ASV | TIMIT | 4.87 | FGSM | 11.89 | FGSM | 8.31 |
LDS | 9.26 | ||||||||
Wu [113] | ACC | VGG | ASVspoof 2019 | 99.99 | PGD | 37.06 | PGD | 92.40 | |
SENet | 99.97 | 48.32 | 98.60 | ||||||
Pal [114] | ACC | 1D-CNN | LibriSpeech | 99.55 | FGSM | 6.03 | HTA10 | 90.60 | |
PGD-10 | 0.00 | 81.12 | |||||||
CW-10 | 0.00 | 80.12 | |||||||
Attack Detection | Li [115] | DA | VGG-like | VoxCeleb1 | - | BIM-xvec | 90.65 | ||
EER | 0.46 | ||||||||
Villalba [116] | ACC | Espresso | VoxCeleb 1&2 | - | CW-L2 | 82.9 | |||
Peng [117] | FAR | Twin Models | VoxCeleb1 | - | 4.48 | ||||
Wu [118] | ACC | Representation | VoxCeleb 1&2 | - | Vocoder (0.01 FPR) | 98.92 | |||
Joshi [119] | EER | AdvEst | Voxceleb2 | - | FGSM/BIM/CW | 14.57 | |||
Input Refactoring | Joshi [76] | ACC | ResNet34 | LibriSpeech | 100 | BIM | 0 | PWG BPDA | 97.2 |
CW | 1.3 | 98.8 | |||||||
Wu [113] | ACC | VGG | ASVspoof 2019 | 99.99 | PGD | 37.06 | AT+Mean filter | 93.76 | |
SENet | 99.97 | 48.32 | 99.24 | ||||||
Zhang [120] | EER | SE-Resnext | VCTK | 1.43 | FGSM | 13.81 | 3.62 | ||
PGD | 16.66 | 1.94 | |||||||
Wu [121] | ACC | LCNN | ASVspoof 2019 | 80–90 | PGD | (5–10) | (80–90) | ||
SENet | 80–90 | (5–10) | (80–90) | ||||||
Wu [122] | AdvFAR | x-vector | VoxCeleb1 | 5.97 | BIM | 87.36 | 16.54 | ||
AdvFRR | 51.95 | 16.88 | |||||||
j-FAR | r-vector | 8.40 | 48.04 | 17.84 | |||||
j-FRR | 30.41 | 18.51 | |||||||
Wu [123] | FRR | Fast ResNet-34 | VoxCeleb 1&2 | ASP | 2.24 | BIM-10 | 89.38 | 3.6 | |
FAR | 2.56 | 91.94 | 16.67 | ||||||
Wu [124] | EER | r-vector | VoxCeleb1 | 8.87 | BIM | 66.02 | 22.94 | ||
Oliver [125] | ACC | 1D-CNN | LibriSpeech | 88 | PGD | 7 | MAD | 74 | |
CW | 9 | 69 | |||||||
Chang [126] | ASR | i-vector | LibriSpeech | - | BIM | 100 | 1 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tan, H.; Wang, L.; Zhang, H.; Zhang, J.; Shafiq, M.; Gu, Z. Adversarial Attack and Defense Strategies of Speaker Recognition Systems: A Survey. Electronics 2022, 11, 2183. https://doi.org/10.3390/electronics11142183
Tan H, Wang L, Zhang H, Zhang J, Shafiq M, Gu Z. Adversarial Attack and Defense Strategies of Speaker Recognition Systems: A Survey. Electronics. 2022; 11(14):2183. https://doi.org/10.3390/electronics11142183
Chicago/Turabian StyleTan, Hao, Le Wang, Huan Zhang, Junjian Zhang, Muhammad Shafiq, and Zhaoquan Gu. 2022. "Adversarial Attack and Defense Strategies of Speaker Recognition Systems: A Survey" Electronics 11, no. 14: 2183. https://doi.org/10.3390/electronics11142183