Res2Former: Integrating Res2Net and Transformer for a Highly Efficient Speaker Verification System
Abstract
1. Introduction
- Traditional CNNs are inefficient in extracting deep speech features because they struggle to capture long-range dependencies and global features, which limits the discriminative power of the feature representations they produce.
- Transformers, although excellent at modeling global dependencies, have a large number of parameters and are complex to train. Their high computational complexity makes them difficult to deploy in lightweight or resource-constrained environments.
- The complexity of both CNN and Transformers results in slower inference speeds. Significant computational overhead during inference limits their applicability in real-time scenarios that require rapid responses.
- We propose a lightweight simple transformer (LST) architecture that retains the advantages of Transformers while significantly reducing both model parameters and computational complexity by simplifying the self-attention mechanism and optimizing the network design.
- We introduce Res2Former, an effective SV model that builds on LST by incorporating the Res2Net architecture. By leveraging the multi-scale feature extraction capability of Res2Net, Res2Former enhances its ability to capture fine-grained characteristics of speech signals, thereby improving the discriminative performance of speaker embeddings.
- We design feature processing strategies and time-frequency adaptive feature fusion(TAFF) mechanisms at different network depths. By introducing targeted feature processing methods at various layers and integrating attention mechanisms from both the time and frequency domains, we enhance the richness and discriminative power of the feature representations.
- By employing a combination of pre-training and large-margin fine-tuning strategies, we optimize pre-trained models and further improve model performance.
2. Method
2.1. Lightweight Simple Transformer (LST)
2.1.1. Simplified Multi-Scale Convolutional Attention
2.1.2. A Feedforward Network Based on Global Response Normalization
2.2. Res2Former Fusing LST Based on Res2Net Structure
2.2.1. Overall Overview of Res2Former
- The Convolutional Normalization Layer (P-C K = 1 & LayerNorm) processes the input features through a convolutional layer with a kernel size of 1, followed by layer normalization. The formula is as follows:
- The core module of each stage is the Res2Former Block, which combines the multi-scale convolutional capability of Res2Net with the feature processing efficiency of the Transformer. The formula can be expressed as follows:
2.2.2. The Time-Frequency Adaptive Feature Fusion Mechanism
2.2.3. Frame-Level Time-Frequency Adaptive Feature Fusion
2.2.4. Overall Time-Frequency Adaptive Feature Fusion
2.3. Feature Processing Strategies and ASP at Different Depths
2.4. Loss Function and Large Margin Fine-Tuning
3. Experimental Setup
3.1. Dataset
3.2. Evaluation Metrics
3.3. Model Configurations
3.4. Training Options
4. Experimental Results
4.1. Experimental Results on VoxCeleb1
4.2. Experimental Results on Cn-Celeb(E)
4.3. Memory Consumption and Ablation
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Kinnunen, T.; Li, H. An overview of text-independent speaker recognition: From features to supervectors. Speech Commun. 2010, 52, 12–40. [Google Scholar] [CrossRef]
- Naika, R. An overview of automatic speaker verification system. In Proceedings of the Intelligent Computing and Information and Communication: Proceedings of 2nd International Conference, ICICC 2017; Springer: Berlin/Heidelberg, Germany, 2018; pp. 603–610. [Google Scholar]
- Chen, G.; Chenb, S.; Fan, L.; Du, X.; Zhao, Z.; Song, F.; Liu, Y. Who is real bob? adversarial attacks on speaker recognition systems. In Proceedings of the 2021 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 24–27 May 2021; pp. 694–711. [Google Scholar]
- Hayashi, V.T.; Ruggiero, W.V. Hands-free authentication for virtual assistants with trusted IoT device and machine learning. Sensors 2022, 22, 1325. [Google Scholar] [CrossRef] [PubMed]
- Sigona, F.; Grimaldi, M. Validation of an ECAPA-TDNN system for Forensic Automatic Speaker Recognition under case work conditions. Speech Commun. 2024, 158, 103045. [Google Scholar] [CrossRef]
- Waghmare, K.; Gawali, B. Speaker Recognition for forensic application: A Review. J. Posit. Sch. Psychol. 2022, 6, 984–992. [Google Scholar]
- Variani, E.; Lei, X.; McDermott, E.; Moreno, I.L.; Gonzalez-Dominguez, J. Deep neural networks for small footprint text-dependent speaker verification. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 4052–4056. [Google Scholar]
- Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-vectors: Robust dnn embeddings for speaker recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef] [PubMed]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
- Desplanques, B.; Thienpondt, J.; Demuynck, K. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv 2020, arXiv:2005.07143. [Google Scholar]
- Heo, H.J.; Shin, U.H.; Lee, R.; Cheon, Y.; Park, H.M. NeXt-TDNN: Modernizing multi-scale temporal convolution backbone for speaker verification. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 11186–11190. [Google Scholar]
- Vaswani, A. Attention is all you need. In Proceedings of the NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Khan, M.; Ahmad, J.; Gueaieb, W.; De Masi, G.; Karray, F.; El Saddik, A. Joint Multi-Scale Multimodal Transformer for Emotion Using Consumer Devices. IEEE Trans. Consum. Electron. 2025, 71, 1092–1101. [Google Scholar] [CrossRef]
- Chen, H.; Zendehdel, N.; Leu, M.C.; Moniruzzaman, M.; Yin, Z.; Hajmohammadi, S. Repetitive Action Counting Through Joint Angle Analysis and Video Transformer Techniques. In Proceedings of the International Symposium on Flexible Automation. American Society of Mechanical Engineers, Seattle, DC, USA, 21–24 July 2024; Volume 87882, p. V001T08A003. [Google Scholar]
- Safari, P.; India, M.; Hernando, J. Self-attention encoding and pooling for speaker recognition. arXiv 2020, arXiv:2008.01077. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio, Speech, Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
- Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
- Zhang, N.; Wang, J.; Hong, Z.; Zhao, C.; Qu, X.; Xiao, J. Dt-sv: A transformer-based time-domain approach for speaker verification. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–7. [Google Scholar]
- Zhang, Y.; Lv, Z.; Wu, H.; Zhang, S.; Hu, P.; Wu, Z.; Lee, H.y.; Meng, H. Mfa-conformer: Multi-scale feature aggregation conformer for automatic speaker verification. arXiv 2022, arXiv:2203.15249. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
- Peng, J.; Stafylakis, T.; Gu, R.; Plchot, O.; Mošner, L.; Burget, L.; Černockỳ, J. Parameter-efficient transfer learning of pre-trained transformer models for speaker verification using adapters. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Peng, J.; Plchot, O.; Stafylakis, T.; Mošner, L.; Burget, L.; Černockỳ, J. An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification. In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Rhodes Island, Greece, 4–10 June 2023; pp. 555–562. [Google Scholar]
- Okabe, K.; Koshinaka, T.; Shinoda, K. Attentive statistics pooling for deep speaker embedding. arXiv 2018, arXiv:1803.10963. [Google Scholar]
- Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
- Nagrani, A.; Chung, J.S.; Zisserman, A. Voxceleb: A large-scale speaker identification dataset. arXiv 2017, arXiv:1706.08612. [Google Scholar]
- Chung, J.S.; Nagrani, A.; Zisserman, A. Voxceleb2: Deep speaker recognition. arXiv 2018, arXiv:1806.05622. [Google Scholar]
- Fan, Y.; Kang, J.; Li, L.; Li, K.; Chen, H.; Cheng, S.; Zhang, P.; Zhou, Z.; Cai, Y.; Wang, D. Cn-celeb: A challenging chinese speaker recognition dataset. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4–9 May 2020; pp. 7604–7608. [Google Scholar]
- Li, L.; Liu, R.; Kang, J.; Fan, Y.; Cui, H.; Cai, Y.; Vipperla, R.; Zheng, T.F.; Wang, D. Cn-celeb: Multi-genre speaker recognition. Speech Commun. 2022, 137, 77–91. [Google Scholar] [CrossRef]
- Cai, W.; Chen, J.; Zhang, J.; Li, M. On-the-fly data loader and utterance-level aggregation for speaker and language recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 2020, 28, 1038–1051. [Google Scholar] [CrossRef]
- Snyder, D.; Chen, G.; Povey, D. Musan: A music, speech, and noise corpus. arXiv 2015, arXiv:1510.08484. [Google Scholar]
- Ko, T.; Peddinti, V.; Povey, D.; Seltzer, M.L.; Khudanpur, S. A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5220–5224. [Google Scholar]
Backbone | Block | Params(M) | Voxceleb-O | Voxceleb-E | Voxceleb-H | |||
---|---|---|---|---|---|---|---|---|
EER% | minDCF | EER% | minDCF | EER% | minDCF | |||
ECAPA-TDNN | 3 | 20.8 | 1.34 | 0.16 | 1.60 | 0.18 | 3.17 | 0.30 |
Transformer | 6 | 11.8 | 1.64 | 0.15 | 1.86 | 0.20 | 3.15 | 0.29 |
9 | 16.5 | 1.62 | 0.14 | 1.78 | 0.18 | 3.11 | 0.28 | |
12 | 21.1 | 1.62 | 0.17 | 1.81 | 0.19 | 3.13 | 0.29 | |
LST | 6 | 10.7 | 1.16 | 0.12 | 1.38 | 0.15 | 2.52 | 0.24 |
9 | 14.6 | 1.02 | 0.11 | 1.32 | 0.15 | 2.44 | 0.23 | |
12 | 18.5 | 1.12 | 0.12 | 1.30 | 0.14 | 2.40 | 0.23 |
Model | Params(M) | Voxceleb-O | Voxceleb-E | Voxceleb-H | ||||
---|---|---|---|---|---|---|---|---|
EER% | minDCF | EER% | minDCF | EER% | minDCF | |||
MFA-Conformer | 28.35 | 1.03 | 0.12 | 1.36 | 0.15 | 2.59 | 0.24 | |
ECAPA-TDNN | 20.8 | 1.34 | 0.16 | 1.60 | 0.18 | 3.17 | 0.30 | |
NeXt-TDNN | 7.14 | 0.92 | 0.10 | 1.03 | 0.11 | 1.91 | 0.18 | |
Res2Net | 4.03 | 1.86 | 0.19 | 1.69 | 0.18 | 2.92 | 0.26 | |
Base | Res2Former (B = 6, C = 80) | 1.73 | 1.23 | 0.12 | 1.36 | 0.14 | 2.38 | 0.23 |
Res2Former (B = 3, C = 128) | 2.39 | 1.03 | 0.09 | 1.22 | 0.13 | 2.11 | 0.21 | |
Res2Former (B = 2, C = 192) | 3.81 | 0.99 | 0.09 | 1.13 | 0.12 | 1.99 | 0.19 | |
Large | Res2Former (B = 2, C = 256) | 6.62 | 0.81 | 0.08 | 0.98 | 0.11 | 1.81 | 0.17 |
Res2Former (B = 2, C = 288) | 8.31 | 0.91 | 0.09 | 1.07 | 0.12 | 2.00 | 0.20 | |
Res2Former (B = 1, C = 384) | 9.06 | 1.01 | 0.08 | 1.11 | 0.12 | 1.97 | 0.19 |
Model | RTF() | CN-Celeb(E) | ||
---|---|---|---|---|
EER% | minDCF | |||
MFA-Conformer | 33.06 | 12.13 | 0.62 | |
ECAPA-TDNN | 42.14 | 11.66 | 0.60 | |
NeXt-TDNN | 5.30 | 10.48 | 0.54 | |
Res2Net | 6.85 | 11.12 | 0.58 | |
Base | Res2Former (B = 6, C = 80) | 6.45 | 9.89 | 0.57 |
Res2Former (B = 3, C = 128) | 4.25 | 9.19 | 0.56 | |
Res2Former (B = 2, C = 192) | 4.14 | 8.43 | 0.47 | |
Large | Res2Former (B = 2, C = 256) | 5.16 | 8.39 | 0.46 |
Res2Former (B = 2, C = 288) | 6.14 | 8.99 | 0.48 | |
Res2Former (B = 1, C = 384) | 7.11 | 9.16 | 0.49 |
Voxceleb1-O | Voxceleb1-E | Voxceleb1-H | ||||
---|---|---|---|---|---|---|
EER% | minDCF | EER% | minDCF | EER% | minDCF | |
Res2Former (B = 2, C = 256) | 0.81 | 0.08 | 0.98 | 0.11 | 1.81 | 0.17 |
Without GRN | 0.85 | 0.09 | 1.04 | 0.11 | 1.85 | 0.17 |
Without Concat | 1.22 | 0.11 | 1.25 | 0.13 | 2.21 | 0.21 |
Without TAFF | 1.10 | 0.09 | 1.20 | 0.12 | 2.07 | 0.19 |
Weight Avg | 1.26 | 0.11 | 1.21 | 0.13 | 2.13 | 0.19 |
Only Concat | 0.90 | 0.09 | 1.05 | 0.11 | 1.86 | 0.17 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, D.; Zhou, Y.; Wang, X.; Xiang, S.; Liu, X.; Sang, Y. Res2Former: Integrating Res2Net and Transformer for a Highly Efficient Speaker Verification System. Electronics 2025, 14, 2489. https://doi.org/10.3390/electronics14122489
Chen D, Zhou Y, Wang X, Xiang S, Liu X, Sang Y. Res2Former: Integrating Res2Net and Transformer for a Highly Efficient Speaker Verification System. Electronics. 2025; 14(12):2489. https://doi.org/10.3390/electronics14122489
Chicago/Turabian StyleChen, Defu, Yunlong Zhou, Xianbao Wang, Sheng Xiang, Xiaohu Liu, and Yijian Sang. 2025. "Res2Former: Integrating Res2Net and Transformer for a Highly Efficient Speaker Verification System" Electronics 14, no. 12: 2489. https://doi.org/10.3390/electronics14122489
APA StyleChen, D., Zhou, Y., Wang, X., Xiang, S., Liu, X., & Sang, Y. (2025). Res2Former: Integrating Res2Net and Transformer for a Highly Efficient Speaker Verification System. Electronics, 14(12), 2489. https://doi.org/10.3390/electronics14122489