Symmetry and Asymmetry Principles in Deep Speaker Verification Systems: Balancing Robustness and Discrimination Through Hybrid Neural Architectures
Abstract
1. Introduction
2. Related Work
3. Proposed Model
3.1. Audio Stream Encoder
3.2. Video Stream Encoder
3.3. Gated Fusion Module
3.4. Shared Temporal Encoder
3.5. Multi-Head Attention Pooling and Embedding Learning
3.6. Theoretical Interpretation of the Symmetry/Asymmetry Principle
4. Experimental Setup
4.1. Datasets
4.2. Feature Extraction
4.3. Training Protocol
4.4. Evaluation Metrics
4.5. Computational Cost
5. Results
5.1. Ablation Study
5.2. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Cohen, T.S.; Welling, M. Group Equivariant Convolutional Networks. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; pp. 2990–2999. [Google Scholar]
- Bronstein, M.M.; Bruna, J.; LeCun, Y.; Szlam, A.; Vandergheynst, P. Geometric Deep Learning: Going beyond Euclidean Data. IEEE Signal Process. Mag. 2017, 34, 18–42. [Google Scholar] [CrossRef]
- Lin, Q.; Yang, L.; Wang, X.; Qin, X.; Wang, J.; Li, M. Towards Lightweight Applications: Asymmetric Enroll-Verify Structure for Speaker Verification. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7067–7071. [Google Scholar] [CrossRef]
- Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar]
- Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of the Interspeech 2020, ISCA, Shanghai, China, 25–29 October 2020; pp. 3830–3834. [Google Scholar]
- Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4685–4694. [Google Scholar]
- Rajasekhar, G.P.; Alam, J. Audio-Visual Speaker Verification via Joint Cross-Attention. In Speech and Computer; Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M., Eds.; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2023; Volume 14339, pp. 18–31. ISBN 978-3-031-48311-0. [Google Scholar]
- Chetty, G.; Wagner, M. Audiovisual Speaker Identity Verification Based on Lip Motion Features. In Proceedings of the Interspeech 2007, ISCA, Antwerp, Belgium, 27–31 August 2007; pp. 2045–2048. [Google Scholar]
- Liu, M.; Lee, K.A.; Wang, L.; Zhang, H.; Zeng, C.; Dang, J. Cross-Modal Audio-Visual Co-Learning for Text-Independent Speaker Verification. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Bando, Y.; Aizawa, T.; Itoyama, K.; Nakadai, K. Weakly-Supervised Neural Full-Rank Spatial Covariance Analysis for a Front-End System of Distant Speech Recognition. In Proceedings of the Interspeech 2022, ISCA, Incheon, Republic of Korea, 18–22 September 2022; pp. 3824–3828. [Google Scholar]
- Chung, J.S.; Nagrani, A.; Zisserman, A. VoxCeleb2: Deep Speaker Recognition. In Proceedings of the INTERSPEECH 2018, Hyderabad, India, 2–6 September 2018; pp. 1086–1090. [Google Scholar] [CrossRef]
- Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 6558–6569. [Google Scholar]
- Shi, B.; Hsu, W.-N.; Lakhotia, K.; Mohamed, A. Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction. In Proceedings of the International Conference on Learning Representations (ICLR), Online, 25–29 April 2022. [Google Scholar]
- Ephrat, A.; Halperin, T.; Peleg, S. Improved Speech Reconstruction from Silent Video. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017; pp. 455–462. [Google Scholar]
- Arevalo, J.; Solorio, T.; Montes-y-Gómez, M.; González, F.A. Gated Multimodal Units for Information Fusion. In Proceedings of the International Conference on Learning Representations (ICLR) Workshop, Toulon, France, 24–26 April 2017. [Google Scholar]
- Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; Song, L. SphereFace: Deep Hypersphere Embedding for Face Recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6738–6746. [Google Scholar]
- Yang, Z.; Wang, X.; Xia, D.; Wang, W.; Dai, H. Sequence-Based Device-Free Gesture Recognition Framework for Multi-Channel Acoustic Signals. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Jeon, S.; Lee, J.; Lee, Y.-J. Dual-Stream Former: A Dual-Branch Transformer Architecture for Visual Speech Recognition. AI 2025, 6, 222. [Google Scholar] [CrossRef]
- Tao, R.; Das, R.K.; Li, H. Audio-Visual Speaker Recognition with a Cross-Modal Discriminative Network. In Proceedings of the Interspeech 2020, ISCA, Shanghai, China, 25–29 October 2020; pp. 2242–2246. [Google Scholar]
- Shi, B.; Mohamed, A.; Hsu, W.-N. Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT. In Proceedings of the Interspeech 2022, ISCA, Incheon, Republic of Korea, 18–22 September 2022; pp. 4785–4789. [Google Scholar]
- Graves, A.; Schmidhuber, J. Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef] [PubMed]
- Rajasekhar, G.P.; Alam, J. SSAVSV: Towards Unified Model for Self-Supervised Audio–Visual Speaker Verification. arXiv 2025, arXiv:2506.17694. [Google Scholar] [CrossRef]
- Wu, Z.; Shen, C.; van den Hengel, A. Wider or Deeper: Revisiting the ResNet Model for Visual Recognition. Pattern Recognit. 2019, 90, 119–133. [Google Scholar] [CrossRef]
- Sarı, L.; Singh, K.; Zhou, J.; Torresani, L.; Singhal, N.; Saraf, Y. A Multi-View Approach to Audio–Visual Speaker Verification. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 6194–6198. [Google Scholar]
- Ren, W.; Ma, L.; Zhang, J.; Pan, J.; Cao, X.; Liu, W.; Yang, M.-H. Gated Fusion Network for Single Image Dehazing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3253–3261. [Google Scholar]
- Snyder, D.; Chen, G.; Povey, D. MUSAN: A Music, Speech, and Noise Corpus. In Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, AZ, USA, 13–17 December 2015; pp. 348–352. [Google Scholar]
- Hu, C.; Hao, Y.; Zhang, F.; Luo, X.; Shen, Y.; Gao, Y.; Deng, C.; Zhang, S.; Feng, J. Privacy-Preserving Speaker Verification via End-to-End Secure Representation Learning. In Proceedings of the INTERSPEECH 2025, Rotterdam, The Netherlands, 17–21 August 2025; pp. 1508–1512. [Google Scholar]



| Noise | SNR (dB) | EER (%) | minDCF |
|---|---|---|---|
| Babble | 5 | 4.01 | 0.44 |
| Babble | 0 | 5.17 | 0.56 |
| Speech | 5 | 4.78 | 0.43 |
| Speech | 0 | 4.98 | 0.54 |
| Music | 5 | 3.91 | 0.42 |
| Music | 0 | 4.85 | 0.52 |
| System | Modality | Dataset | EER (%) | minDCF |
|---|---|---|---|---|
| SSAVSV [22] | Audio + Visual | VoxCeleb1 | 6.135 | 0.472 |
| VFNet [19] | Audio + Visual | VoxCeleb2 | 22.52 | — |
| AV-HuBERT (audio + lip) [20] | Audio + Lip | VoxCeleb2 | 3.7 | — |
| Ours (Proposed Model) | Audio + Lip | VoxCeleb2 | 3.419 | 0.342 |
| Variant | Symmetric Layers Kept | Asymmetric Layers Kept | EER (%) |
|---|---|---|---|
| (S0) Average fusion and Statistics pooling (maximally symmetric baseline) | TDNN ResNet Shared Conformer L2-norm Statistics pooling Average fusion Cosine scoring | - | 5.765 |
| (S1) Residual fusion + Statistics pooling | TDNN ResNet Shared Conformer L2-norm Statistics pooling Cosine scoring | BiLSTM Residual fusion | 4.860 |
| (S2) Gated fusion and Statistics pooling | TDNN ResNet Shared Conformer L2-norm Statistics pooling Cosine scoring | BiLSTM Gated fusion | 3.652 |
| (S3) Full model (ours) | TDNN ResNet Shared Conformer L2-norm Cosine scoring | BiLSTM Gated fusion MHA | 3.419 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Thiyagarajan, S.; Kim, D.-H. Symmetry and Asymmetry Principles in Deep Speaker Verification Systems: Balancing Robustness and Discrimination Through Hybrid Neural Architectures. Symmetry 2026, 18, 121. https://doi.org/10.3390/sym18010121
Thiyagarajan S, Kim D-H. Symmetry and Asymmetry Principles in Deep Speaker Verification Systems: Balancing Robustness and Discrimination Through Hybrid Neural Architectures. Symmetry. 2026; 18(1):121. https://doi.org/10.3390/sym18010121
Chicago/Turabian StyleThiyagarajan, Sundareswari, and Deok-Hwan Kim. 2026. "Symmetry and Asymmetry Principles in Deep Speaker Verification Systems: Balancing Robustness and Discrimination Through Hybrid Neural Architectures" Symmetry 18, no. 1: 121. https://doi.org/10.3390/sym18010121
APA StyleThiyagarajan, S., & Kim, D.-H. (2026). Symmetry and Asymmetry Principles in Deep Speaker Verification Systems: Balancing Robustness and Discrimination Through Hybrid Neural Architectures. Symmetry, 18(1), 121. https://doi.org/10.3390/sym18010121

