Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (4)

Search Parameters:
Keywords = audio-visual speaker verification

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
15 pages, 1386 KB  
Article
Symmetry and Asymmetry Principles in Deep Speaker Verification Systems: Balancing Robustness and Discrimination Through Hybrid Neural Architectures
by Sundareswari Thiyagarajan and Deok-Hwan Kim
Symmetry 2026, 18(1), 121; https://doi.org/10.3390/sym18010121 - 8 Jan 2026
Viewed by 476
Abstract
Symmetry and asymmetry are foundational design principles in artificial intelligence, defining the balance between invariance and adaptability in multimodal learning systems. In audio-visual speaker verification, where speech and lip-motion features are jointly modeled to determine whether two utterances belong to the same individual, [...] Read more.
Symmetry and asymmetry are foundational design principles in artificial intelligence, defining the balance between invariance and adaptability in multimodal learning systems. In audio-visual speaker verification, where speech and lip-motion features are jointly modeled to determine whether two utterances belong to the same individual, these principles govern both fairness and discriminative power. In this work, we analyze how symmetry and asymmetry emerge within a gated-fusion architecture that integrates Time-Delay Neural Networks and Bidirectional Long Short-Term Memory encoders for speech, ResNet-based visual lip encoders, and a shared Conformer-based temporal backbone. Structural symmetry is preserved through weight-sharing across paired utterances and symmetric cosine-based scoring, ensuring verification consistency regardless of input order. In contrast, asymmetry is intentionally introduced through modality-dependent temporal encoding, multi-head attention pooling, and a learnable gating mechanism that dynamically re-weights the contribution of audio and visual streams at each timestep. This controlled asymmetry allows the model to rely on visual cues when speech is noisy, and conversely on speech when lip visibility is degraded, yielding adaptive robustness under cross-modal degradation. Experimental results demonstrate that combining symmetric embedding space design with adaptive asymmetric fusion significantly improves generalization, reducing Equal Error Rate (EER) to 3.419% on VoxCeleb-2 test dataset without sacrificing interpretability. The findings show that symmetry ensures stable and fair decision-making, while learnable asymmetry enables modality awareness together forming a principled foundation for next-generation audio-visual speaker verification systems. Full article
Show Figures

Figure 1

16 pages, 414 KB  
Article
Audio–Visual Fusion Based on Interactive Attention for Person Verification
by Xuebin Jing, Liang He, Zhida Song and Shaolei Wang
Sensors 2023, 23(24), 9845; https://doi.org/10.3390/s23249845 - 15 Dec 2023
Cited by 1 | Viewed by 3346
Abstract
With the rapid development of multimedia technology, personnel verification systems have become increasingly important in the security field and identity verification. However, unimodal verification systems have performance bottlenecks in complex scenarios, thus triggering the need for multimodal feature fusion methods. The main problem [...] Read more.
With the rapid development of multimedia technology, personnel verification systems have become increasingly important in the security field and identity verification. However, unimodal verification systems have performance bottlenecks in complex scenarios, thus triggering the need for multimodal feature fusion methods. The main problem with audio–visual multimodal feature fusion is how to effectively integrate information from different modalities to improve the accuracy and robustness of the system for individual identity. In this paper, we focus on how to improve multimodal person verification systems and how to combine audio and visual features. In this study, we use pretrained models to extract the embeddings from each modality and then perform fusion model experiments based on these embeddings. The baseline approach in this paper involves taking the fusion feature and passing it through a fully connected (FC) layer. Building upon this baseline, we propose three fusion models based on attentional mechanisms: attention, gated, and inter–attention. These fusion models are trained on the VoxCeleb1 development set and tested on the evaluation sets of the VoxCeleb1, NIST SRE19, and CNC-AV datasets. On the VoxCeleb1 dataset, the best system performance achieved in this study was an equal error rate (EER) of 0.23% and a detection cost function (minDCF) of 0.011. On the evaluation set of NIST SRE19, the EER was 2.60% and the minDCF was 0.283. On the evaluation set of the CNC-AV set, the EER was 11.30% and the minDCF was 0.443. These experimental results strongly demonstrate that the proposed fusion method can significantly improve the performance of multimodal character verification systems. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

23 pages, 4407 KB  
Article
Empirical Comparison between Deep and Classical Classifiers for Speaker Verification in Emotional Talking Environments
by Ali Bou Nassif, Ismail Shahin, Mohammed Lataifeh, Ashraf Elnagar and Nawel Nemmour
Information 2022, 13(10), 456; https://doi.org/10.3390/info13100456 - 27 Sep 2022
Cited by 4 | Viewed by 2891
Abstract
Speech signals carry various bits of information relevant to the speaker such as age, gender, accent, language, health, and emotions. Emotions are conveyed through modulations of facial and vocal expressions. This paper conducts an empirical comparison of performances between the classical classifiers: Gaussian [...] Read more.
Speech signals carry various bits of information relevant to the speaker such as age, gender, accent, language, health, and emotions. Emotions are conveyed through modulations of facial and vocal expressions. This paper conducts an empirical comparison of performances between the classical classifiers: Gaussian Mixture Model (GMM), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Artificial neural networks (ANN); and the deep learning classifiers, i.e., Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN), and Gated Recurrent Unit (GRU) in addition to the ivector approach for a text-independent speaker verification task in neutral and emotional talking environments. The deep models undergo hyperparameter tuning using the Grid Search optimization algorithm. The models are trained and tested using a private Arabic Emirati Speech Database, Ryerson Audio–Visual Database of Emotional Speech and Song dataset (RAVDESS) database, and a public Crowd-Sourced Emotional Multimodal Actors (CREMA) database. Experimental results illustrate that deep architectures do not necessarily outperform classical classifiers. In fact, evaluation was carried out through Equal Error Rate (EER) along with Area Under the Curve (AUC) scores. The findings reveal that the GMM model yields the lowest EER values and the best AUC scores across all datasets, amongst classical classifiers. In addition, the ivector model surpasses all the fine-tuned deep models (CNN, LSTM, and GRU) based on both evaluation metrics in the neutral, as well as the emotional speech. In addition, the GMM outperforms the ivector using the Emirati and RAVDESS databases. Full article
(This article belongs to the Special Issue Signal Processing Based on Convolutional Neural Network)
Show Figures

Figure 1

15 pages, 559 KB  
Article
Multimodal Diarization Systems by Training Enrollment Models as Identity Representations
by Victoria Mingote, Ignacio Viñals, Pablo Gimeno, Antonio Miguel, Alfonso Ortega and Eduardo Lleida
Appl. Sci. 2022, 12(3), 1141; https://doi.org/10.3390/app12031141 - 21 Jan 2022
Viewed by 2882
Abstract
This paper describes a post-evaluation analysis of the system developed by ViVoLAB research group for the IberSPEECH-RTVE 2020 Multimodal Diarization (MD) Challenge. This challenge focuses on the study of multimodal systems for the diarization of audiovisual files and the assignment of an identity [...] Read more.
This paper describes a post-evaluation analysis of the system developed by ViVoLAB research group for the IberSPEECH-RTVE 2020 Multimodal Diarization (MD) Challenge. This challenge focuses on the study of multimodal systems for the diarization of audiovisual files and the assignment of an identity to each segment where a person is detected. In this work, we implemented two different subsystems to address this task using the audio and the video from audiovisual files separately. To develop our subsystems, we used the state-of-the-art speaker and face verification embeddings extracted from publicly available deep neural networks (DNN). Different clustering techniques were also employed in combination with the tracking and identity assignment process. Furthermore, we included a novel back-end approach in the face verification subsystem to train an enrollment model for each identity, which we have previously shown to improve the results compared to the average of the enrollment data. Using this approach, we trained a learnable vector to represent each enrollment character. The loss function employed to train this vector was an approximated version of the detection cost function (aDCF) which is inspired by the DCF widely used metric to measure performance in verification tasks. In this paper, we also focused on exploring and analyzing the effect of training this vector with several configurations of this objective loss function. This analysis allows us to assess the impact of the configuration parameters of the loss in the amount and type of errors produced by the system. Full article
Show Figures

Figure 1

Back to TopTop