A User Recognition Methodology Based on Voice Biometrics and Dynamic Clustering for Social Robots
Abstract
1. Introduction
2. Related Work
2.1. Speaker Recognition Methods
2.2. Speaker Recognition in Social Robotics
3. User Recognition from Voice Biometrics
3.1. Materials and Tools
3.1.1. Voice Biometrics Extraction
- Vosk [44] is an open-source speech recognition toolkit designed for local execution without an internet connection. It offers various Automatic Speech Recognition models across multiple languages, and includes a lightweight (just 13 MB), text-dependent, and language-independent speaker identification module. Although performance metrics such as accuracy or Word Error Rate have not yet been formally reported, the toolkit is widely used for its efficiency [45,46,47,48]. The speaker model generates embeddings, called X-vectors [49], with a fixed dimensionality of 128 capturing voice features.
- SpeechBrain [50] is an open-source community toolkit for speech processing. It offers a wide variety of models tailored to different functionality requirements, such as Automatic Speech Recognition, emotion recognition or speech enhancement to remove noise from audio recordings. Most can run on-device without a network connection. It includes text-independent models for speaker verification, recognition, or diarisation. The selected SpeechBrain models for speaker verification, that are pre-trained with the audio-visual dataset Voxceleb (Visual Geometry Group, University of Oxford. “VoxCeleb: Large-Scale Speaker Identification” dataset, https://www.robots.ox.ac.uk/~vgg/data/voxceleb/ (accessed on 13 February 2026)), are:
- -
- Time Delay Neural Network (TDNN) (SpeechBrain “spkrec-xvect-voxceleb” model card, Hugging Face, https://huggingface.co/speechbrain/spkrec-xvect-voxceleb (accessed on 13 February 2026).) [51], that has as output embeddings called X-vectors [49] with 512 values to collect the voice biometrics. The model weighs 100–120 MB.
- -
- Time Delay Neural Network with Attention (ECAPA-TDNN) (SpeechBrain “spkrec-ecapa-voxceleb” model card, Hugging Face, https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb (accessed on 13 February 2026).) [52], with a size of 50–60 MB and vector’s called ECAPA vectors [52] with dimensions of 192 values for voice features.
- -
- Residual Neural Network (ResNet) (SpeechBrain “spkrec-resnet-voxceleb” model card, Hugging Face, (https://huggingface.co/speechbrain/spkrec-resnet-voxceleb (accessed on 13 February 2026).) [53], with a weight of 70–120 MB and embeddings, called Residual vectors (r-vectors) [54], of 256 values.
- Resemblyzer [55] from Resemble AI [56]: Resemblyzer is a Python library (Python version 3.5 or above required) that allows the speaker’s voice biometrics extraction using a pre-trained Deep Learning model. The output vector of the model is a variation of deep vectors (d-vectors) [57], and its dimension is 256. The model is text-independent, runs locally without an internet connection, and weighs 30–40 MB.
3.1.2. Clustering Algorithms
- Neighbourhood radius (): Defines the maximum distance between a point and a core point (the centre of a cluster) to be considered a neighbour. must be above 0.
- Minimum points (Mp): The number of points necessary to form a new cluster. It has to be greater than or equal to 1.
- Metric: The distance function used to measure the similarity between points. The available functions are: “cosine”, “cityblock”, “euclidean”, “l1”, “l2”, “manhattan”, and “nan euclidean”.
- Decaying factor (): Controls how fast the algorithm forgets old points. has to be greater than 0, but the smaller this value, the slower it forgets.
- Weight threshold (): Defines the weight threshold to distinguish a potential micro-cluster from an outlier micro-cluster. must be within the range .
- Minimum weight (): The minimum weight required to consider a micro-cluster dense enough to be a core micro-cluster. has to be above , as must be greater than 1.
- Micro-cluster radius (): The maximum radius of a micro-cluster. must be above 0.
- Initial number of samples (): The number of points used in the initial phase to create the initial micro-cluster. It has to be greater than or equal to 1.
- Stream speed (v): The number of points processed per time unit. It must be greater than or equal to 1.
- Clustering threshold (r): Defines the maximum radius of micro-clusters. r has to be above 0.
- Fading factor (): Controls how fast old points lose importance. must be greater than 0, and the smaller its value, the more slowly old points are forgotten.
- Clean-up interval (): Determines how frequently outliers or low-density micro-clusters are removed from memory. Must be above 0.
- Intersection factor (): Threshold for merging two overlapping micro-clusters. Must be greater than 0.
- Minimum weight (): Minimum micro-cluster’s weight to be considered a stable micro-cluster. Should be greater than 1.
3.2. Proposed Methodology
- The system must run in real time. This means user identification must be provided immediately after the user speaks to the robot without delay.
- It has to work without human intervention, meaning it has to run autonomously and unsupervised.
- It must be incremental, implying the addition of new embeddings without processing the saved data again.
- It has to be dynamic, allowing the addition of new users without manually pre-recording and preprocessing their information.
- It must run locally without an internet connection or access to an external server.
- The tools and algorithms employed must be as light as possible to run on computers with limited computational resources.
4. Methodology Evaluation
4.1. Participants for Offline Evaluation
4.2. Procedure for Offline Evaluation
4.3. Metrics for Offline Evaluation
- Model inference time: It is the time the voice biometric extraction model takes from the moment an audio sample is provided to the model until the model returns its associated embedding, as shown in Figure 4a.
- Model total time: It is the time the model spends from the moment the first audio sample is processed, returning its embedding, until the last audio sample is processed, going through the entire database, as shown in Figure 4b.
- External clustering metrics:
- -
- Adjusted Rand Index (ARI): ARI computes a similarity measure between the predicted clustering and the ground-truth labels based on pair counting, adjusting for chance. Its values range from (less agreement than expected by chance) to 1 (perfect match), with 0 corresponding to random labelling [66,67].
- -
- Adjusted Mutual Information (AMI): AMI compares the predicted clustering with the ground-truth labels using the information theory (mutual information), adjusting for chance. Its values are between 0 (random agreement) and 1 (perfect agreement) [68].
- -
- V-Measure: This metric is the harmonic mean of homogeneity and completeness. Homogeneity measures whether each cluster contains only objects from a single class, while completeness measures whether all objects of a given class are assigned to the same cluster. The V-Measure ranges from 0 to 1, where 1 stands for perfectly complete labelling [69].
- Internal clustering metrics:
- -
- Silhouette Coefficient: This coefficient provides a graphical display of the clusters’ silhouettes, showing how well the objects are classified within each cluster. The Silhouette Coefficient goes from to 1, where 1 is the best value and 0 means overlapping clusters [70].
- Custom evaluation metrics:
- -
- Clusters: The number of clusters created by the clustering algorithm to group all the embeddings.
- -
- Clustering latency: The amount of time the clustering algorithm takes to process all the embeddings and provide a result, as shown in Figure 4b.
- -
- Success rate: It measures the number of samples that have been correctly clustered during the leave-one-out cross-validation tests.
- -
- Validation inference time: It is the time the clustering algorithm spends during the leave-one-out cross-validation test to cluster all embeddings for each training-validation dataset, as shown in Figure 4c.
- -
- Validation total time: It is the time the leave-one-out cross-validation test takes, as shown in Figure 4c.
4.4. Results of Offline Evaluation
4.4.1. Embedding Extraction Time Performance
4.4.2. Hyperparameter Optimisation for Each Embedding–Clustering Combination
4.4.3. Evaluation of New Samples Detection
5. Integration in a Pet-like Social Robot
5.1. Our Pet-like Social Robot
5.2. Participants for Online Evaluation
5.3. Procedure for Online Evaluation
5.4. Metrics for Online Evaluation
- Number of interactions: The number of times the system extracts an embedding from the audio captured and provides a cluster result when participants talk to the robot.
- Correct detections: It is the number of times the clustering algorithm returns the correct cluster given a new embedding. For known users, it refers to the cluster associated with that user. For unknown users, we consider a correct detection when the clustering algorithm creates a new cluster for the user and assigns the new embedding from this user to it. It is important to note that the system sometimes assigns different people to the same cluster when loading the robot’s database for the first time because it cannot distinguish between them. In this situation, we consider a correct detection if the cluster returned by the system for a new embedding matches the initial prediction. We also report the correct detections as a percentage.
- Incorrect detections: We consider an incorrect detection when the clustering algorithm returns the cluster of a different user than the one speaking. The cluster must be from an existing user, excluding noise detection. In addition, we report the incorrect detections as a percentage.
- Noise detection: This is the number of times the clustering algorithm classifies an embedding as noise, not providing a valid cluster number. We exclude from this counter the noise detections that are later treated as a new user’s cluster, as they are necessary to generate a profile for an unknown speaker. Additionally, the noise rate is reported as the percentage of noise detections.
- New cluster detections: It is the number of times a new cluster is detected across the interactions. For known speakers, the cluster must be different from the one associated with their IDs. For unknown speakers, we do not count the first cluster created for them. Additionally, we report the new cluster detection as a percentage.
- New clusters created: It refers to the number of new clusters created across the interactions for a given user. The appearance of a new cluster when an unknown speaker is talking to the robot is considered a good sign of recognising a new user, and its detection is considered a correct detection. Meanwhile, for a known speaker, it does not count as a correct or incorrect detection.
- Time to cluster database: Time the clustering algorithm spends loading and obtaining the cluster for the database of known users, as shown in Figure 7.
- Embedding-to-cluster latency: It is the time from the moment the model produces an embedding until a cluster number is obtained, as shown in Figure 7.
5.5. Results of Online Evaluation
6. Discussion
Limitations
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Phrases Used for Dataset Recording
- Today is a good day to go for a walk in the park and enjoy the sun, even though the weather is constantly changing.
- The penguin walks slowly along the beach while the dog barks in the garden.
- The blue house is near the river and has a huge garden full of colourful flowers that perfume the air in spring.
- Today the wind is blowing strongly, and clouds cover the entire sky, while the branches of the trees move as if they want to dance with the storm.
- The snowy mountains seem to glow in the light of dawn, and the valley’s silence is broken only by the distant song of birds.
- It is always good to share a meal with family and friends, because around the table, sincere conversations, happy memories and new smiles are born.
References
- Chan, J.; Nejat, G. Social Intelligence for a Robot Engaging People in Cognitive Training Activities. Int. J. Adv. Robot. Syst. 2012, 9, 51171. [Google Scholar] [CrossRef] [PubMed]
- Donnermann, M.; Schaper, P.; Lugrin, B. Integrating a Social Robot in Higher Education—A Field Study. In 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN); IEEE: New York, NY, USA, 2020; pp. 573–579. [Google Scholar]
- Koh, W.Q.; Ang, F.X.H.; Casey, D. Impacts of Low-cost Robotic Pets for Older Adults and People With Dementia: Scoping Review. JMIR Rehabil. Assist. Technol. 2021, 8, e25340. [Google Scholar] [CrossRef]
- Bharatharaj, J.; Huang, L.; Al-Jumaily, A.M. Bio-inspired therapeutic pet robots: Review and future direction. In 2015 10th International Conference on Information, Communications and Signal Processing (ICICS); IEEE: New York, NY, USA, 2015. [Google Scholar] [CrossRef]
- Ruggiero, A.; Mahr, D.; Odekerken-Schröder, G.; Spena, T.R.; Mele, C. Companion robots for well-being: A review and relational framework. In Research Handbook on Services Management; Edward Elgar Publishing: Cheltenham, UK, 2022; pp. 309–330. [Google Scholar]
- Berridge, C.; Zhou, Y.; Robillard, J.M.; Kaye, J. Companion robots to mitigate loneliness among older adults: Perceptions of benefit and possible deception. Front. Psychol. 2023, 14, 1106633. [Google Scholar] [CrossRef]
- Gasteiger, N.; Hellou, M.; Ahn, H.S. Factors for Personalization and Localization to Optimize Human–Robot Interaction: A Literature Review. Int. J. Soc. Robot. 2021, 15, 689–701. [Google Scholar] [CrossRef]
- Di Napoli, C.; Ercolano, G.; Rossi, S. Personalized home-care support for the elderly: A field experience with a social robot at home. User Model. User-Adapt. Interact. 2023, 33, 405–440. [Google Scholar] [CrossRef]
- Maroto-Gómez, M.; Alonso-Martín, F.; Malfaz, M.; Castro-González, Á.; Castillo, J.C.; Salichs, M.Á. A systematic literature review of decision-making and control systems for autonomous and social robots. Int. J. Soc. Robot. 2023, 15, 745–789. [Google Scholar] [CrossRef]
- Maroto-Gómez, M.; Castro-González, Á.; Castillo, J.C.; Malfaz, M.; Salichs, M.Á. An adaptive decision-making system supported on user preference predictions for human–robot interactive communication. User Model. User-Adapt. Interact. 2023, 33, 359–403. [Google Scholar] [CrossRef]
- Maroto-Gómez, M.; Lewis, M.; Castro-González, Á.; Malfaz, M.; Salichs, M.Á.; Cañamero, L. Adapting to my user, engaging with my robot: An adaptive affective architecture for a social assistive robot. ACM Trans. Intell. Syst. Technol. 2024, 15, 125. [Google Scholar] [CrossRef]
- Arango, J.A.R.; Marco-Detchart, C.; Inglada, V.J.J. Personalized Cognitive Support via Social Robots. Sensors 2025, 25, 888. [Google Scholar] [CrossRef]
- Jain, A.K.; Ross, A.; Prabhakar, S. An introduction to biometric recognition. IEEE Trans. Circuits Syst. Video Technol. 2004, 14, 4–20. [Google Scholar] [CrossRef]
- Yan, H.; Ang, M.H.; Poo, A. A Survey on Perception Methods for Human–Robot Interaction in Social Robots. Int. J. Soc. Robot. 2013, 6, 85–119. [Google Scholar] [CrossRef]
- Yang, D.; Chae, Y.J.; Kim, D.; Lim, Y.; Kim, D.H.; Kim, C.; Park, S.K.; Nam, C. Effects of social behaviors of robots in privacy-sensitive situations. Int. J. Soc. Robot. 2022, 14, 589–602. [Google Scholar] [CrossRef]
- Prabakaran, D.; Shyamala, R. A Review On Performance Of Voice Feature Extraction Techniques. In 2019 3rd International Conference on Computing and Convergence Technology; IEEE: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
- Kozhirbayev, Z.; Erol, B.A.; Sharipbay, A.; Jamshidi, M. Speaker recognition for robotic control via an iot device. In 2018 World Automation Congress (WAC); IEEE: New York, NY, USA, 2018; pp. 1–5. [Google Scholar]
- Tuasikal, D.A.A.; Fakhrurroja, H.; Machbub, C. Voice Activation Using Speaker Recognition for Controlling Humanoid Robot. In 2018 IEEE 8th International Conference on System Engineering and Technology (ICSET); IEEE: New York, NY, USA, 2018. [Google Scholar] [CrossRef]
- Alonso-Martín, F.; Salichs, M.A. Integration of a voice generation systems in a social robot. Cybern. Syst. 2011, 42, 215–245. [Google Scholar] [CrossRef]
- Foggia, P.; Greco, A.; Roberto, A.; Saggese, A.; Vento, M.; Foggia, P.; Greco, A.; Roberto, A.; Saggese, A.; Vento, M. Few-shot re-identification of the speaker by social robots. Auton. Robot. 2022, 47, 181–192. [Google Scholar] [CrossRef]
- Amirgaliyev, B.; Mussabek, M.; Rakhimzhanova, T.; Zhumadillayeva, A. A review of machine learning and deep learning methods for person detection, tracking and identification, and face recognition with applications. Sensors 2025, 25, 1410. [Google Scholar] [CrossRef]
- Wiskott, L.; Fellous, J.; Krüger, N.; von der Malsburg, C. Face recognition by elastic bunch graph matching. In Proceedings of International Conference on Image Processing; IEEE: New York, NY, USA, 1997. [Google Scholar] [CrossRef]
- Wang, Y.; Shen, J.; Petridis, S.; Pantić, M. A real-time and unsupervised face re-identification system for human–robot interaction. Pattern Recognit. Lett. 2019, 128, 559–568. [Google Scholar] [CrossRef]
- Khalifa, A.; Abdelrahman, A.A.; Strazdas, D.; Hintz, J.; Hempel, T.; Al-Hamadi, A. Face Recognition and Tracking Framework for Human–Robot Interaction. Appl. Sci. 2022, 12, 5568. [Google Scholar] [CrossRef]
- Mekruksavanich, S.; Jitpattanakul, A. Biometric User Identification Based on Human Activity Recognition Using Wearable Sensors: An Experiment Using Deep Learning Models. Electronics 2021, 10, 308. [Google Scholar] [CrossRef]
- Lu, Z.; Wang, R.; Zhou, H.; Dong, N.; Lv, H.; Yang, G. A Novel Gait Identity Recognition Method for Personalized Human–robot Collaboration in Industry 5.0. Chin. J. Mech. Eng. 2025, 38, 191. [Google Scholar] [CrossRef]
- Álvarez-Aparicio, C.; Guerrero-Higueras, A.M.; González-Santamarta, M.Á.; Campazas-Vega, A.; Matellán, V.; Fernández-Llamas, C. Biometric recognition through gait analysis. Sci. Rep. 2022, 12, 14530. [Google Scholar] [CrossRef]
- Al-Qaderi, M.; Rad, A. A Multi-Modal Person Recognition System for Social Robots. Appl. Sci. 2018, 8, 387. [Google Scholar] [CrossRef]
- Freire-Obregón, D.; Rosales-Santana, K.; Marín-Reyes, P.A.; Peñate-Sánchez, A.; Lorenzo-Navarro, J.; Castrillón-Santana, M. Improving user verification in human–robot interaction from audio or image inputs through sample quality assessment. Pattern Recognit. Lett. 2021, 149, 179–184. [Google Scholar] [CrossRef]
- Folorunso, C.; Asaolu, O.; Popoola, O. A review of voice-base person identification: State-of-the-art. Covenant J. Eng. Technol. 2019, 3, 36–57. [Google Scholar]
- Bai, Z.; Zhang, X.L.; Chen, J. Speaker recognition based on deep learning: An overview. Neural Netw. 2021, 140, 65–99. [Google Scholar] [CrossRef]
- Campbell, J. Speaker recognition: A tutorial. Proc. IEEE 1997, 85, 1437–1462. [Google Scholar] [CrossRef]
- Brydinskyi, V.; Khoma, Y.; Sabodashko, D.; Podpora, M.; Khoma, V.; Konovalov, A.; Kostiak, M. Comparison of Modern Deep Learning Models for Speaker Verification. Appl. Sci. 2024, 14, 1329. [Google Scholar] [CrossRef]
- Faundez-Zanuy, M.; Monte-Moreno, E. State-of-the-art in speaker recognition. IEEE Aerosp. Electron. Syst. Mag. 2005, 20, 7–12. [Google Scholar] [CrossRef]
- Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
- Shome, N.; Sarkar, A.; Ghosh, A.K.; Laskar, R.H.; Kashyap, R. Speaker Recognition through Deep Learning Techniques: A Comprehensive Review and Research Challenges. Period. Polytech. Electr. Eng. Comput. Sci. 2023, 67, 300–336. [Google Scholar] [CrossRef]
- Steck, H.; Ekanadham, C.; Kallus, N. Is cosine-similarity of embeddings really about similarity? In WWW ’24: Companion Proceedings of the ACM Web Conference 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 887–890. [Google Scholar]
- Farrell, K.; Mammone, R.J.; Assaleh, K. Speaker recognition using neural networks and conventional classifiers. IEEE Trans. Speech Audio Process. 1994, 2, 194–205. [Google Scholar] [CrossRef]
- Bansé, D.; Doddington, G.R.; Garcia-Romero, D.; Godfrey, J.J.; Greenberg, C.S.; Martin, A.F.; McCree, A.; Przybocki, M.; Reynolds, D.A. Summary and initial results of the 2013–2014 speaker recognition i-vector machine learning challenge. In Proceedings of the Interspeech 2014, Singapore, 14–18 September 2014; pp. 368–372. [Google Scholar]
- Kiani, K.; Baniasadi, A. Speaker Recognition System based on Identity Vector using T-SNE Visualization and Mean-shift Algorithm. In 2019 5th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS); IEEE: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
- Sadjadi, S.O.; Kheyrkhah, T.; Tong, A.; Greenberg, C.S.; Reynolds, D.A.; Singer, E.; Mason, L.P.; Hernandez-Cordero, J. The 2016 NIST Speaker Recognition Evaluation. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017. [Google Scholar]
- Sadjadi, S.O.; Greenberg, C.S.; Singer, E.; Reynolds, D.A.; Mason, L.P.; Hernandez-Cordero, J. The 2019 NIST Audio-Visual Speaker Recognition Evaluation. In Proceedings of the Odyssey 2020, Tokyo, Japan, 1–5 November 2020; pp. 259–265. [Google Scholar]
- Sadjadi, S.O.; Greenberg, C.S.; Singer, E.; Mason, L.; Reynolds, D.A. The 2021 NIST Speaker Recognition Evaluation. In Proceedings of the Odyssey 2022: The Speaker and Language Recognition Workshop, Beijing, China, 28 June–1 July 2022; pp. 322–329. [Google Scholar] [CrossRef]
- Cephei, A. Vosk Offline Speech Recognition API. 2025. Available online: https://alphacephei.com/vosk/ (accessed on 13 February 2026).
- Asha, C.; D’Souza, J.M. Voice-Controlled Object Pick and Place for Collaborative Robots Employing the ROS2 Framework. In 2024 International Conference on Advances in Modern Age Technologies for Health and Engineering Science (AMATHE); IEEE: New York, NY, USA, 2024; pp. 1–7. [Google Scholar]
- Lemaignan, S.; Cooper, S.; Ros, R.; Ferrini, L.; Andriella, A.; Irisarri, A. Open-source natural language processing on the pal robotics ari social robot. In HRI ’23: Companion of the 2023 ACM/IEEE International Conference on Human–Robot Interaction; Association for Computing Machinery: New York, NY, USA, 2023; pp. 907–908. [Google Scholar]
- Sikorski, P.; Yu, K.; Billadeau, L.; Esposito, F.; AliAkbarpour, H.; Babaias, M. Improving Robotic Arms Through Natural Language Processing, Computer Vision, and Edge Computing. In 2025 3rd International Conference on Mechatronics, Control and Robotics (ICMCR); IEEE: New York, NY, USA, 2025; pp. 35–41. [Google Scholar]
- Soni, A.A. Improving Speech Recognition Accuracy Using Custom Language Models with the Vosk Toolkit. arXiv 2025, arXiv:2503.21025. [Google Scholar] [CrossRef]
- Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing; IEEE: New York, NY, USA, 2018. [Google Scholar] [CrossRef]
- Ravanelli, M.; Parcollet, T.; Plantinga, P.; Rouhe, A.; Cornell, S.; Lugosch, L.; Subakan, C.; Dawalatabad, N.; Heba, A.; Zhong, J.; et al. SpeechBrain: A General-Purpose Speech Toolkit. arXiv 2021, arXiv:2106.04624. [Google Scholar]
- Snyder, D.; Garcia-Romero, D.; McCree, A.; Sell, G.; Povey, D.; Khudanpur, S. Spoken Language Recognition using X-vectors. In Proceedings of the Odyssey 2018: The Speaker and Language Recognition Workshop, Les Sables d’Olonne, France, 26–29 June 2018. [Google Scholar]
- Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of the INTERSPEECH 2020: Conference of the International Speech Communication Association, Shanghai, China, 25–29 October 2020. [Google Scholar] [CrossRef]
- Villalba, J.; Chen, N.; Snyder, D.; Garcia-Romero, D.; McCree, A.V.; Sell, G.; Borgstrom, J.; García-Perera, L.P.; Richardson, F.; Dehak, R.; et al. State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations. Comput. Speech Lang. 2020, 60, 101026. [Google Scholar] [CrossRef]
- Zeinali, H.; Wang, S.; Silnova, A.; Matějka, P.; Plchot, O. BUT System Description to VoxCeleb Speaker Recognition Challenge. arXiv 2019, arXiv:1910.12592. [Google Scholar] [CrossRef]
- Resemble AI. Public. Resemblyzer: A Python Package to Analyze and Compare Voices with Deep Learning. 2019. Available online: https://github.com/resemble-ai/Resemblyzer (accessed on 13 February 2026).
- Resemble AI. Resemble AI: Generative Voice AI for Enterprise. 2024. Available online: https://www.resemble.ai/ (accessed on 13 February 2026).
- Wan, L.; Wang, Q.; Papir, A.; Moreno, I.L. Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2018; pp. 4879–4883. [Google Scholar]
- Zhang, T.; Ramakrishnan, R.; Livny, M. BIRCH: An efficient data clustering method for very large databases. ACM SIGMOD Rec. 1996, 25, 103–114. [Google Scholar] [CrossRef]
- Ester, M.; Kriegel, H.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial Databases with Noise. In KDD ’96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining; AAAI Press: Washington, DC, USA, 1996. [Google Scholar]
- Ester, M.; Kriegel, H.P.; Sander, J.; Wimmer, M.; Xu, X. Incremental Clustering for Mining in a Data Ware Housing; University of Munich Oettingenstr: Munich, Germany, 1998; Volume 67. [Google Scholar]
- Cao, F.; Ester, M.; Qian, W.; Zhou, A. Density-Based Clustering over an Evolving Data Stream with Noise. In Proceedings of the 2006 SIAM International Conference on Data Mining (SDM); SIAM: Philadelphia, PA, USA, 2006. [Google Scholar] [CrossRef]
- Hahsler, M.; Bolaos, M. Clustering Data Streams Based on Shared Density between Micro-Clusters. IEEE Trans. Knowl. Data Eng. 2016, 28, 1449–1461. [Google Scholar] [CrossRef]
- Milligan, S.; Sales, G.; Khirnykh, K. Sound levels in rooms housing laboratory animals: An uncontrolled daily variable. Physiol. Behav. 1993, 53, 1067–1076. [Google Scholar] [CrossRef]
- Liashchynskyi, P.; Liashchynskyi, P. Grid search, random search, genetic algorithm: A big comparison for NAS. arXiv 2019, arXiv:1912.06059. [Google Scholar] [CrossRef]
- Berrar, D. Cross-validation. In Encyclopedia of Bioinformatics and Computational Biology; Elsevier: Amsterdam, The Netherlands, 2019. [Google Scholar]
- Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
- Steinley, D. Properties of the hubert-arable adjusted rand index. Psychol. Methods 2004, 9, 386–396. [Google Scholar] [CrossRef] [PubMed]
- Vinh, N.X.; Epps, J.; Bailey, J. Information theoretic measures for clusterings comparison: Variants, Properties, Normalization and Correction for Chance. J. Mach. Learn. Res. 2010, 11, 2837–2854. [Google Scholar]
- Rosenberg, A.; Hirschberg, J. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic, 28–30 June 2007. [Google Scholar]
- Rousseeuw, P. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
- Cabezaolías, C.; de la Cruz Díaz, A.; Maroto-Gómez, M.; Castillo, J.C.; Salichs, M.Á. A Pet Robot Prototype for Animal-Assisted Therapy. In Advances in Practical Applications of Agents, Multi-Agent Systems, and Digital Twins: The PAAMS Collection; Springer: Cham, Switzerland, 2024; pp. 330–336. [Google Scholar]
- Bo, V.; Garrell, A.; Sanfeliu, A. Fast or Accurate? How Intention-Recognition Models Shape Human Perception of a Mobile Robot. In HRI ’26: Companion Proceedings of the 21st ACM/IEEE International Conference on Human–Robot Interaction; Association for Computing Machinery: New York, NY, USA, 2026; pp. 502–506. [Google Scholar]
- Waveren, S.V.; Carter, E.; Leite, I. Take One For the Team: The Effects of Error Severity in Collaborative Tasks with Social Robots. In IVA ’19: Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents; Association for Computing Machinery: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
- Hancock, P.A.; Kessler, T.; Kaplan, A.D.; Brill, J.C.; Szalma, J.L. Evolving Trust in Robots: Specification Through Sequential and Comparative Meta-Analyses. Hum. Factors 2021, 63, 1196–1229. [Google Scholar] [CrossRef]
- Campagna, G.; Rehm, M. A Systematic Review of Trust Assessments in Human–Robot Interaction. ACM Trans. Hum.-Robot. Interact. 2024, 14, 30. [Google Scholar] [CrossRef]








| Reference | Advantages | Limitations |
|---|---|---|
| Alonso-Martín and Salichs [19] | User verification and speech extraction | No standardised performance evaluation |
| Kozhirbayev et al. [17] | Speaker recognition using Neural Networks | Requires external online server |
| and Mel-Frequency Cepstral Coefficients | and does not recognise unknown users | |
| Tuasikal et al. [18] | User verification using Dynamic Time Warping | Requires external online server |
| and Mel-Frequency Cepstral Coefficients | ||
| Foggia et al. [20] | Known and unknown user recognition | Requires embedded GPU (NVIDIA Jetson) |
| Algorithm | Hyperparameter | Values Range |
|---|---|---|
| BIRCH | Threshold (T) | 1.0–25.5, with step 0.5 |
| IncDBSCAN | Neighbourhood radius () | 0.1–1.0, with step 0.1 |
| Minimum points () | 1–10, step 1 | |
| Metric | cosine | |
| DenStream | Decaying factor () | 0.005, 0.01 |
| Weight threshold () | 0.1–0.7, with step 0.1 | |
| Minimum weight () | 7–13, with step 1 | |
| Micro-cluster radius () | 0.3–0.8, with step 0.1 | |
| Initial samples () | 8–13, with step 1 | |
| Stream speed (v) | 3–7, with step 1 | |
| DBSTREAM | Clustering threshold (r) | 0.5–7.0, with steps around 1.0 |
| Fading factor () | 0.01, 0.001 | |
| Clean-up interval () | 2–6, with step 1 | |
| Intersection factor () | 0.1–1.0, with step 0.1 | |
| Minimum weight () | 0.1, 0.1–5, with step 1 |
| Voice Biometric Extraction Model | Model Inference Time (s) | Model Total Time (s) | Number of Embeddings |
|---|---|---|---|
| SpeechBrain TDNN | 0.094 | 45.321 | 480 |
| Resemblyzer | 0.145 | 69.790 | 480 |
| Vosk | 0.645 | 327.280 | 506 |
| SpeechBrain ECAPA-TDNN | 3.008 | 1479.945 | 480 |
| SpeechBrain ResNet | 28.448 | 13,654.823 | 480 |
| Model | Selected Hyperparameters |
|---|---|
| BIRCH + Vosk | |
| BIRCH + SpeechBrain ECAPA-TDNN | |
| BIRCH + SpeechBrain ResNet | |
| BIRCH + SpeechBrain TDNN | |
| BIRCH + Resemblyzer | |
| IncDBSCAN + Vosk | , , cosine |
| IncDBSCAN + SpeechBrain ECAPA-TDNN | , , cosine |
| IncDBSCAN + SpeechBrain ResNet | , , cosine |
| IncDBSCAN + SpeechBrain TDNN | , , cosine |
| IncDBSCAN + Resemblyzer | , , cosine |
| DenStream + Vosk | , , , , , |
| DenStream + SpeechBrain embeddings | , , , , , |
| DenStream + Resemblyzer | , , , , , |
| DBSTREAM + All embeddings | , , , , |
| Rank | Model | ARI | AMI | V-Measure | Silhouette | Clusters | Clustering Latency (s) |
|---|---|---|---|---|---|---|---|
| 1 | BIRCH + Vosk | 0.793 | 0.936 | 0.957 | 0.294 | 40 | 0.064 |
| 2 | IncDBSCAN + Resemblyzer | 0.824 | 0.923 | 0.952 | 0.475 | 44 | 0.301 |
| 3 | IncDBSCAN + Vosk | 0.774 | 0.921 | 0.947 | 0.309 | 42 | 0.278 |
| 4 | BIRCH + SpeechBrain ResNet | 0.722 | 0.870 | 0.918 | 0.247 | 47 | 0.065 |
| 5 | BIRCH + Resemblyzer | 0.519 | 0.827 | 0.884 | 0.311 | 41 | 0.065 |
| 6 | IncDBSCAN + SpeechBrain ResNet | 0.515 | 0.807 | 0.876 | 0.276 | 43 | 0.301 |
| 7 | IncDBSCAN + SpeechBrain TDNN | 0.393 | 0.770 | 0.843 | 0.237 | 35 | 0.358 |
| 8 | BIRCH + SpeechBrain TDNN | 0.331 | 0.734 | 0.819 | 0.202 | 41 | 0.055 |
| 9 | BIRCH + SpeechBrain ECAPA-TDNN | 0.325 | 0.689 | 0.784 | 0.190 | 35 | 0.085 |
| 10 | IncDBSCAN + SpeechBrain ECAPA-TDNN | 0.320 | 0.717 | 0.800 | 0.148 | 34 | 0.293 |
| 11 | DenStream + Vosk | 0.317 | 0.715 | 0.763 | 0.176 | 14 | 1.067 |
| 12 | DenStream + Resemblyzer | 0.227 | 0.671 | 0.728 | 0.158 | 13 | 3.341 |
| 13 | DenStream + SpeechBrain ResNet | 0.113 | 0.513 | 0.568 | 0.116 | 8 | 1.555 |
| 14 | DenStream + SpeechBrain ECAPA-TDNN | 0.105 | 0.436 | 0.476 | 0.109 | 5 | 0.499 |
| 15 | DenStream + SpeechBrain TDNN | 0.069 | 0.386 | 0.420 | 0.145 | 4 | 0.825 |
| 16 | DBSTREAM + Vosk | 0.028 | 0.189 | 0.205 | 0.096 | 2 | 0.075 |
| 17 | DBSTREAM + SpeechBrain ECAPA-TDNN | 0.026 | 0.184 | 0.201 | 0.129 | 2 | 0.107 |
| 18 | DBSTREAM + SpeechBrain TDNN | 0.026 | 0.171 | 0.187 | 0.161 | 2 | 0.263 |
| 19 | DBSTREAM + SpeechBrain ResNet | 0.022 | 0.169 | 0.186 | 0.087 | 2 | 0.135 |
| 20 | DBSTREAM + Resemblyzer | 0.016 | 0.131 | 0.149 | 0.051 | 2 | 0.139 |
| Rank | Model | ARI | AMI | V-Measure | Silhouette | Clusters | Validation Inference Time (s) | Validation Total Time (s) | Success Rate (%) |
|---|---|---|---|---|---|---|---|---|---|
| 1 | IncDBSCAN + Vosk | 0.774 | 0.920 | 0.946 | 0.332 | 36.02 | 0.281 | 148.305 | 91.30 |
| 2 | BIRCH + Vosk | 0.789 | 0.935 | 0.956 | 0.294 | 39.95 | 0.073 | 53.167 | 84.98 |
| 3 | IncDBSCAN + Resemblyzer | 0.823 | 0.924 | 0.953 | 0.475 | 45.01 | 0.295 | 148.298 | 81.04 |
| 4 | IncDBSCAN + SpeechBrain ResNet | 0.515 | 0.808 | 0.877 | 0.277 | 43.99 | 0.289 | 145.226 | 72.50 |
| 5 | BIRCH + Resemblyzer | 0.518 | 0.827 | 0.884 | 0.311 | 40.99 | 0.073 | 50.579 | 64.58 |
| 6 | BIRCH + SpeechBrain ResNet | 0.724 | 0.871 | 0.919 | 0.248 | 47.49 | 0.072 | 50.459 | 52.29 |
| User | Number of Interactions | Correct Detections | Incorrect Detections | Noise Detections | New Cluster Detections | New Clusters Created |
|---|---|---|---|---|---|---|
| Known 1 | 21 | 2 (9.52%) | 1 (4.76%) | 2 (9.52%) | 16 (76.19%) | 1 |
| Known 2 | 26 | 24 (92.31%) | 0 (0%) | 2 (7.69%) | - | 0 |
| Known 3 | 20 | 11 (55%) | 0 (0%) | 9 (45%) | - | 0 |
| Known 4 | 26 | 25 (96.15%) | 0 (0%) | 1 (3.85%) | - | 0 |
| Known 5 | 23 | 23 (100%) | 0 (0%) | 0 (0%) | - | 0 |
| Total Known | 116 | 85 (73.28%) | 1 (0.86%) | 14 (12.07%) | 16 (13.79%) | 1 |
| Unknown 1 | 21 | 21 (100%) | 0 (0%) | 0 (0%) | - | 1 |
| Unknown 2 | 19 | 19 (100%) | 0 (0%) | 0 (0%) | - | 1 |
| Unknown 3 | 26 | 12 (46.15%) | 0 (0%) | 14 (53.85%) | - | 1 |
| Unknown 4 | 14 | 14 (100%) | 0 (0%) | 0 (0%) | - | 1 |
| Unknown 5 | 27 | 10 (37.04%) | 0 (0%) | 10 (37.04%) | 7 (25.93%) | 2 |
| Total Unknown | 107 | 76 (71.03%) | 0 (0%) | 24 (22.43%) | 7 (6.54%) | 6 |
| Total | 223 | 161 (72.20%) | 1 (0.45%) | 38 (17.04%) | 23 (10.31%) | 7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Segura-Bencomo, A.; Maroto-Gómez, M.; Gamboa-Montero, J.J.; Castillo, J.C. A User Recognition Methodology Based on Voice Biometrics and Dynamic Clustering for Social Robots. Appl. Sci. 2026, 16, 4548. https://doi.org/10.3390/app16094548
Segura-Bencomo A, Maroto-Gómez M, Gamboa-Montero JJ, Castillo JC. A User Recognition Methodology Based on Voice Biometrics and Dynamic Clustering for Social Robots. Applied Sciences. 2026; 16(9):4548. https://doi.org/10.3390/app16094548
Chicago/Turabian StyleSegura-Bencomo, Arecia, Marcos Maroto-Gómez, Juan José Gamboa-Montero, and José Carlos Castillo. 2026. "A User Recognition Methodology Based on Voice Biometrics and Dynamic Clustering for Social Robots" Applied Sciences 16, no. 9: 4548. https://doi.org/10.3390/app16094548
APA StyleSegura-Bencomo, A., Maroto-Gómez, M., Gamboa-Montero, J. J., & Castillo, J. C. (2026). A User Recognition Methodology Based on Voice Biometrics and Dynamic Clustering for Social Robots. Applied Sciences, 16(9), 4548. https://doi.org/10.3390/app16094548

