A User Recognition Methodology Based on Voice Biometrics and Dynamic Clustering for Social Robots

Segura-Bencomo, Arecia; Maroto-Gómez, Marcos; Gamboa-Montero, Juan José; Castillo, José Carlos

doi:10.3390/app16094548

Open AccessArticle

A User Recognition Methodology Based on Voice Biometrics and Dynamic Clustering for Social Robots

by

Arecia Segura-Bencomo

^†

,

Marcos Maroto-Gómez

^*,†

,

Juan José Gamboa-Montero

^†

and

José Carlos Castillo

^†

RoboticsLab, Systems Engineering and Automation Department, Universidad Carlos III de Madrid, 28911 Leganés, Madrid, Spain

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2026, 16(9), 4548; https://doi.org/10.3390/app16094548

Submission received: 6 March 2026 / Revised: 22 April 2026 / Accepted: 30 April 2026 / Published: 5 May 2026

(This article belongs to the Section Robotics and Automation)

Download

Browse Figures

Versions Notes

Abstract

Social robots are systems designed to assist people across different fields. During their operation, they have to interact with people with different characteristics and necessities. Consequently, correctly recognising the user interacting with the robot facilitates the generation of a personalised experience that satisfies the user’s needs. In robotics, user recognition is typically based on face recognition from image processing and datasets that require retraining the network to include new users. However, some robots, such as pet-like companions, often lack a camera due to reduced dimensions, limited computational resources, or privacy constraints. Additionally, robots can occasionally encounter new users, requiring online recognition to provide a personalised interaction experience. To address these limitations, this article presents a user recognition system based on voice biometrics and dynamic clustering for adaptive social robots. We evaluate a set of open-source models for voice biometric extraction using different clustering algorithms to identify the best combination for our application. The resulting system is implemented in a pet-like robot companion that is used for the affective support of older adults, demonstrating its capacities in a real-world scenario. The system achieves more than

73 %

accuracy in recognising users who had previously spoken to the robot and more than

71 %

success in recognising new users who had not previously interacted with the robot and creating a personal profile for them. However, the system still detects noise, especially when the speaker has never interacted with the robot.

Keywords:

speaker recognition; user recognition; voice biometrics; dynamic clustering; social robots; human–robot interaction; pet-like robot companion

1. Introduction

Social robots were created to interact with people and assist them in numerous tasks, such as cognitive stimulation exercises [1] and education [2]. In recent years, some of these robots have also been employed as pet companions, with an animal-like appearance and biologically inspired behaviour, seeking the same health benefits that real pets offer us [3]. These robots have demonstrated positive health benefits in various human–robot interaction studies [4,5,6], including stress reduction, loneliness reduction, and depression alleviation. These benefits are especially relevant for the older population, where people tend to be socially isolated or suffer from impairing conditions such as dementia.

Some studies [7,8,9,10,11,12] reveal that social robots with adaptive behaviours tailored to users’ preferences, needs, and characteristics improve users’ perceptions of the robot and their interaction experience. To achieve such personalisation, the robot requires robust user recognition, which is defined as the ability to determine or confirm the identity of an individual [13]. Thus, the robot can differentiate between users and select the most appropriate behaviour accordingly. The literature [14] highlights that most social robots that can recognise users use cameras and integrate face recognition models trained with predefined datasets.

However, the use of cameras brings some important limitations. First, users may be concerned about their privacy when they are recorded, which can affect their willingness to use the robot [15]. Second, dynamic and inconsistent lighting conditions may affect image quality and face recognition algorithms. Finally, some robots, such as pet-like robots, often lack a camera due to their simplified perception systems, which result from their limited computational resources and reduced dimensions. To solve this issue, a promising alternative to camera recognition is the use of voice biometric extraction models to differentiate users based on their voice features.

Recent advances in voice biometrics extraction and speaker recognition [16] demonstrate good recognition rates in different environments. Taking advantage of these developments and the need for social robots to adapt to their users, this paper presents a user recognition method for social robots based on voice biometrics. The method works unsupervised, in real time, without preprocessing or offline training, runs locally, and incrementally detects and remembers users the robot has not previously encountered. Therefore, the main contributions proposed by this research are: (i) exploring and comparing state-of-the-art open-source models for extracting the speaker’s biometrics—selected based on availability, documentation quality, active maintenance, and compatibility with limited computational resources—to find the most suitable for this application, (ii) developing an unsupervised clustering methodology for user recognition that is incremental (capable of adding new samples without retraining) and dynamic (capable of creating new clusters for new speakers), and (iii) integrating this method into a real pet-like robot to ensure its feasibility in constrained devices, and obtain a first assessment of the method’s performance under real-world conditions. These contributions stand out from other related works that achieve speaker recognition on robotic platforms but are not dynamic [17,18] or are not applicable on low-constraint resource robots due to their high computational cost [19,20].

This paper continues in Section 2 by analysing the related literature and emphasising our contribution. Section 3 describes the proposed method for user recognition based on the speaker’s voice features. Section 4 explains the experimental offline procedure followed to determine the best model and algorithm for our application. Section 5 describes the implementation on a pet-like robot companion and the results in recognising known and unknown speakers for an online evaluation. Section 6 discusses the main findings of the method and presents the limitations of the proposed method. Finally, Section 7 summarises the paper and outlines future work.

2. Related Work

User recognition is a crucial capability for social robots to adapt their interactions and behaviours to each individual [8]. Historically, this has been addressed through various biometric modalities, sometimes by fusing information from multiple sensors to increase robustness [21]. Computer vision for face recognition has become a standard in user recognition, evolving from early identification models [22] to modern unsupervised, real-time systems [23] and tracking-enhanced architectures [24]. As an alternative to facial data, behavioural biometrics have emerged, employing either wearable inertial sensors [25,26] or vision-based gait analysis [27] to identify users through activity patterns. Furthermore, multimodal systems have sought to achieve maximum accuracy by fusing face, voice, and body shape data [28], demonstrating that combining biometric channels yields more robust results than single-modality approaches [29]. However, as discussed in Section 1, the specific constraints of robots with hardware limitations, such as pet-like robots, and privacy requirements, justify a focused shift towards voice-based recognition.

2.1. Speaker Recognition Methods

Voice biometrics have been used to detect and identify users in many contexts, such as virtual assistants and security access [30]. The literature defines different modalities depending on the task: verification, which produces a true or false result depending on the match between the speaker and a previously saved identity; identification, which detects the closest identity of the speaker matching a user from a previously trained dataset; diarisation, that consists of answering the question “who spoke when”, splitting the speech from each speaker and identifying each one of them; and recognition, that performs verification and identification consecutively, detecting the closest identity to the speaker and verifying if both match [31].

Commonly, the speaker recognition process comprises two steps: extracting voice features and classifying them to identify and verify the speaker [32]. In computer science, voice features are typically represented as vectors known as embeddings, whose dimensions and specific characteristics vary depending on the underlying model architecture. Once a model is selected, these embeddings maintain a constant dimensionality across different speakers, utilising their internal numerical values to encapsulate the unique voice biometrics of each individual [33]. Models can be text-dependent or text-independent, depending on whether they need to know the speech to extract embeddings [34]. In the context of embedding extraction, Prabakaran et al. [16] explored text-independent mathematical tools for voice feature extraction. The most commonly used tool is the Mel-Frequency Cepstral Coefficients [35], based on the Fourier Transform, which converts the audio signal into the frequency domain using a Mel-frequency filter bank and represents it as a vector of cepstral coefficients. Later, Shome et al. [36] reviewed Deep Learning techniques for speaker recognition, their applications, performance, and challenges. The authors concluded that these models are easily accessible and do not require high-performance software or hardware. However, they pose challenges due to microphone and language mismatches, short utterances, and background noise, all of which significantly impact their performance. Recently, Brydinskyi et al. [33] compared the latest Deep Learning models for embedding extraction and used cosine similarity to verify users, achieving the best results with the TitaNet and ECAPA models, due to their lower error rates of

1.91 %

and

1.71 %

. The model with the best balance of time and performance was ECAPA, with an inference time of

69.43

ms.

Comparing embeddings to verify users’ identities can be achieved through different methodologies, such as cosine similarity, clustering, and Neural Networks, depending on whether the system is supervised or unsupervised [37]. In this line, Farrell et al. [38] analysed some of these text-independent techniques, including Vector Quantisation (a clustering method), Decision Trees, and Neural Tree Networks. The study showed a decline in accuracy as the number of speakers increased, with the Neural Tree Network performing slightly better than the others but requiring more computational resources and time to compare vectors.

The National Institute of Standards and Technology (NIST) periodically organises the Speaker Recognition Evaluation (SRE) to benchmark state-of-the-art speaker recognition systems and drive research progress. In the 2014 evaluation, all approaches relied on identity vectors (i-vectors) extracted with Gaussian Mixture Models, achieving improvements over the baseline minimum detection cost function (minDCF) of up to

23 %

[39]. Kiani et al. [40] later increased this improvement to

25 %

by applying t-Distributed Stochastic Neighbour Embedding (t-SNE) for dimensionality reduction, followed by Mean-Shift clustering to estimate speaker identities. The system then compared test embeddings with the cluster representatives using cosine similarity and a decision threshold to accept or reject the claimed identity.

The latest NIST evaluations have increasingly focused on deep neural speaker embeddings and more challenging real-world conditions. For instance, SRE18 [41] introduced new domains, such as Voice over Internet Protocol, audio from online videos and additional languages, to increase the variability and realism of the evaluation protocol. The following SRE19 [42] incorporated audio-visual person recognition and demonstrated significant performance gains through neural architectures and multimodal fusion. More recent evaluations, such as SRE21 [43] and the ongoing SRE24, continue this trend by focusing on cross-domain, cross-lingual, and multimodal scenarios, aiming to measure state-of-the-art systems under realistic conversational and video-based conditions.

2.2. Speaker Recognition in Social Robotics

Most research methods in speaker recognition are tested offline (i.e., not in real time), as reviewed in the previous sections. However, in social robotics, the recognition system must operate in real time to ensure a natural interaction with users. Some works have addressed this problem to achieve user personalisation. For example, Alonso-Martín and Salichs [19] proposed a speaker recognition method as part of a more extensive voice processing pipeline for human–robot interaction in the social robot Maggie. The system was specifically designed to restrict robot control by denying access to unknown users through the Loquendo Automatic Speech Recognition engine; notably, this text-dependent library enabled both user verification and speech extraction, though the study focused primarily on its practical implementation rather than providing standardised performance metrics.

Kozhirbayev et al. [17] developed a system using Mel-Frequency Cepstral Coefficients for embedding extraction and trained a Neural Network to predict the speaker from a database. They implemented this system on a humanoid robot, supported by a digital assistant (IoT device), to restrict command access to known users only. Tuasikal et al. [18] also used Mel-Frequency Cepstral Coefficients, but combined with Dynamic Time Warping to compute the distance between two signals for user verification to control a humanoid robot. Both studies relied on an external online server for audio processing.

Foggia et al. [20] implemented a speaker recogniser on a Pepper robot using an NVIDIA Jetson embedded system to avoid the limitations of cloud processing. They used a Neural Network based on ResNet34 to extract the voice features, and cosine similarity with a threshold to verify the speaker. Additionally, this system could store information from unknown users for later use. This approach aligns with our objective, but our system must run on a low-resource robot and achieve high recognition rates for new users not included in the original database.

To sum up, Table 1 shows the advantages and limitations of each previous speaker recognition system implemented on a social robot. While aiming to outperform existing methodologies, our approach is designed to operate under strict operational constraints: it eliminates the need for a pre-training stage, operates in real time, and incorporates an incremental learning capacity to identify and manage new speakers previously unknown to the robot. In addition, our methodology runs on systems with low computational resources without requiring a connection to an external server or an additional embedded system. Additionally, we do not need to save the user’s name as an identification label since our pet-like robot does not speak, and it does not need to call users by their names.

3. User Recognition from Voice Biometrics

This section introduces the theoretical background and tools for this research, including voice biometric extraction models and clustering algorithms, which are necessary for developing our methodology. Then, the proposed methodology is presented.

3.1. Materials and Tools

Speaker recognition depends on two important processes: extracting voice biometrics and processing voice data to identify and verify the user. Next, we explain the selected voice biometrics extraction models and their characteristics. Finally, the clustering algorithms used to process and recognise users will also be exposed.

3.1.1. Voice Biometrics Extraction

The voice biometric extraction process returns embeddings, whose type and characteristics depend on the model used. The model selected for embeddings extraction must be balanced in terms of accuracy, real-time processing, speed, and computational load. Therefore, various state-of-the-art open-source extraction models have been compared, taking these constraints into account.

Vosk [44] is an open-source speech recognition toolkit designed for local execution without an internet connection. It offers various Automatic Speech Recognition models across multiple languages, and includes a lightweight (just 13 MB), text-dependent, and language-independent speaker identification module. Although performance metrics such as accuracy or Word Error Rate have not yet been formally reported, the toolkit is widely used for its efficiency [45,46,47,48]. The speaker model generates embeddings, called X-vectors [49], with a fixed dimensionality of 128 capturing voice features.
SpeechBrain [50] is an open-source community toolkit for speech processing. It offers a wide variety of models tailored to different functionality requirements, such as Automatic Speech Recognition, emotion recognition or speech enhancement to remove noise from audio recordings. Most can run on-device without a network connection. It includes text-independent models for speaker verification, recognition, or diarisation. The selected SpeechBrain models for speaker verification, that are pre-trained with the audio-visual dataset Voxceleb (Visual Geometry Group, University of Oxford. “VoxCeleb: Large-Scale Speaker Identification” dataset, https://www.robots.ox.ac.uk/~vgg/data/voxceleb/ (accessed on 13 February 2026)), are:
-
Time Delay Neural Network (TDNN) (SpeechBrain “spkrec-xvect-voxceleb” model card, Hugging Face, https://huggingface.co/speechbrain/spkrec-xvect-voxceleb (accessed on 13 February 2026).) [51], that has as output embeddings called X-vectors [49] with 512 values to collect the voice biometrics. The model weighs 100–120 MB.
-
Time Delay Neural Network with Attention (ECAPA-TDNN) (SpeechBrain “spkrec-ecapa-voxceleb” model card, Hugging Face, https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb (accessed on 13 February 2026).) [52], with a size of 50–60 MB and vector’s called ECAPA vectors [52] with dimensions of 192 values for voice features.
-
Residual Neural Network (ResNet) (SpeechBrain “spkrec-resnet-voxceleb” model card, Hugging Face, (https://huggingface.co/speechbrain/spkrec-resnet-voxceleb (accessed on 13 February 2026).) [53], with a weight of 70–120 MB and embeddings, called Residual vectors (r-vectors) [54], of 256 values.
Resemblyzer [55] from Resemble AI [56]: Resemblyzer is a Python library (Python version 3.5 or above required) that allows the speaker’s voice biometrics extraction using a pre-trained Deep Learning model. The output vector of the model is a variation of deep vectors (d-vectors) [57], and its dimension is 256. The model is text-independent, runs locally without an internet connection, and weighs 30–40 MB.

3.1.2. Clustering Algorithms

Once the embeddings are obtained, they must be processed for speaker identification. While several methodologies exist, we selected clustering algorithms due to their suitability for real-time and unsupervised operation without labels, as knowing the user’s name is unnecessary for the robot. Using this approach, embeddings are grouped into similarity-based clusters, ideally representing individual speakers, each with a number as a reference. Additionally, we selected incremental and dynamic algorithms, enabling online processing and the creation of new clusters for unseen speakers. The selected open source clustering methods are described below.

BIRCH, which stands for “Balanced Iterative Reducing and Clustering using Hierarchies”, is a clustering method designed for large, multi-dimensional datasets. It incrementally and dynamically constructs a clustering feature tree that summarises cluster information into a set of three values, reducing memory usage. These values are the number of data points, the linear sum, and the square sum. The tree is branched based on a threshold, a positive hyperparameter that defines the radius of each cluster [58]. In this work, we use the Scikit-learn (Scikit-learn. Scikit-learn: Machine Learning in Python, https://scikit-learn.org/stable/ (accessed on 13 February 2026)) BIRCH implementation.

IncDBSCAN, or IncrementalDBSCAN, is based on the clustering algorithm DBSCAN (“Density-based spatial clustering of applications with noise”) [59], which can be used on any dataset containing metric-space data. This algorithm processes data incrementally, updating clusters as new points arrive without reprocessing the entire dataset. This approach assumes that adding a new point affects only its cluster and its local neighbours [60]. IncDBSCAN classifies samples as noise (labelled as

- 1

) if they do not belong to any other clusters or if not enough samples are available to form a new cluster. The implementation used in this work, “Incdbscan” (IncDBSCAN project on PyPI, https://pypi.org/project/incdbscan/ (accessed on 13 February 2026)), has three main hyperparameters:

Neighbourhood radius ( $ε$ ): Defines the maximum distance between a point and a core point (the centre of a cluster) to be considered a neighbour. $ε$ must be above 0.
Minimum points (Mp): The number of points necessary to form a new cluster. It has to be greater than or equal to 1.
Metric: The distance function used to measure the similarity between points. The available functions are: “cosine”, “cityblock”, “euclidean”, “l1”, “l2”, “manhattan”, and “nan euclidean”.

DenStream is based on DBSCAN [59] and was developed to discover clusters of different shapes, handle outliers, and avoid the need to define the number of clusters a priori. This algorithm is based on the idea of forming different types of micro-clusters: a core-micro-cluster, a stable group centre with a high density of points; a potential micro-cluster, an evolving micro-cluster not dense enough to be considered a core-micro-cluster; and an outlier micro-cluster, a low-density micro-cluster with potential noise that may later become dense. New points are assigned to the nearest micro-cluster if the distance is below a threshold. The algorithm can forget old points to adapt to new data, freeing memory [61]. DenStream, implemented via the Python library River (“Online machine learning in Python”, https://riverml.xyz/dev/ (accessed on 13 February 2026)), has the following hyperparameters:

Decaying factor ( $λ$ ): Controls how fast the algorithm forgets old points. $λ$ has to be greater than 0, but the smaller this value, the slower it forgets.
Weight threshold ( $β$ ): Defines the weight threshold to distinguish a potential micro-cluster from an outlier micro-cluster. $β$ must be within the range $(0, 1]$ .
Minimum weight ( $μ$ ): The minimum weight required to consider a micro-cluster dense enough to be a core micro-cluster. $μ$ has to be above $1 ∖ β$ , as $β * μ$ must be greater than 1.
Micro-cluster radius ( $ε$ ): The maximum radius of a micro-cluster. $ε$ must be above 0.
Initial number of samples ( $n_{s a m p l e s}$ ): The number of points used in the initial phase to create the initial micro-cluster. It has to be greater than or equal to 1.
Stream speed (v): The number of points processed per time unit. It must be greater than or equal to 1.

DBSTREAM is a density-based clustering algorithm for data streams based on DBSCAN [59], similar to DenStream in its use of micro-clusters. It maintains density-connected micro-clusters and has a shared density graph to represent evolving cluster structures. It is dynamic in updating centres, allowing them to move towards areas of maximal local density, while other algorithms maintain a fixed grip (e.g., D-Stream) [62]. The adjustable hyperparameters, implemented via the Python library River, are:

Clustering threshold (r): Defines the maximum radius of micro-clusters. r has to be above 0.
Fading factor ( $λ$ ): Controls how fast old points lose importance. $λ$ must be greater than 0, and the smaller its value, the more slowly old points are forgotten.
Clean-up interval ( $t_{gap}$ ): Determines how frequently outliers or low-density micro-clusters are removed from memory. Must be above 0.
Intersection factor ( $α$ ): Threshold for merging two overlapping micro-clusters. Must be greater than 0.
Minimum weight ( $w_{\min}$ ): Minimum micro-cluster’s weight to be considered a stable micro-cluster. Should be greater than 1.

The previous tools are used to obtain voice biometrics and cluster numbers that represent each user interacting with the system. The following section describes the user recognition methodology we have developed for voice biometrics extraction and dynamic clustering, including the addition of unknown users who have not previously spoken to the robot.

3.2. Proposed Methodology

The methodology presented in this section works under predefined requirements that our user recognition system must meet to run on low-resource robots. These requirements are:

The system must run in real time. This means user identification must be provided immediately after the user speaks to the robot without delay.
It has to work without human intervention, meaning it has to run autonomously and unsupervised.
It must be incremental, implying the addition of new embeddings without processing the saved data again.
It has to be dynamic, allowing the addition of new users without manually pre-recording and preprocessing their information.
It must run locally without an internet connection or access to an external server.
The tools and algorithms employed must be as light as possible to run on computers with limited computational resources.

Figure 1 shows a flowchart of the proposed methodology, taking into account the previous requirements. The system starts by loading the internal database of saved embeddings from previous interactions if it exists. This database contains the users’ IDs and their corresponding embeddings. Users’ IDs are symbolic codes, not necessarily matching the user’s real name, as they are internal values not used to address the speaker. When the database has content, the embeddings are loaded into the clustering algorithm, and the relationships between the internal users’ IDs and the cluster numbers are also saved. This relationship is not saved in the database because the cluster numbers can change when the system is retrained. Then, the system waits for a user to speak. Once the user interacts with the robot and the microphone captures audio, the system calls the model that performs voice biometric extraction, obtaining an embedding for the current speaker.

Once the embedding is acquired, it is fed to the clustering algorithm, which returns the cluster number for the current speaker. This step can only be achieved if the selected algorithm is incremental. If this requirement is not met, the entire algorithm must be reloaded with all the data, which is not scalable because the clustering time increases with the number of embeddings. If the cluster number matches one of the numbers stored in the system’s initial state, it can obtain the user ID associated with the current speaker. However, if the sample belongs to a new cluster, the system stores the embedding in a provisional list and erases its data from the clustering algorithm working memory, if possible.

When the provisional list exceeds k samples (where k in our implementation is 3 and was empirically determined, since most dynamic clustering algorithms require two samples to create a new cluster, and we add one extra sample for robustness), the embeddings are reinserted into the algorithm. If the cluster number returned for all the embeddings in the list is the same, the system generates a new random ID for this speaker and stores its embeddings in the database. This step is performed to confirm the detection of a new speaker and to avoid punctual mismatches of a known user. This verification can only be achieved if the clustering algorithm is dynamic and allows the appearance of new clusters without processing the full dataset; if the algorithm is not dynamic, the system would always attempt to place the unknown speaker into one of the known clusters.

When the cluster numbers generated for the provisional list of embeddings differ, the system resets the process, erasing the provisional list. This situation occurs, for example, when two different people talk consecutively. Working this way, the system prevents an unwanted embedding from being saved and the creation of an incorrect new user. The provisional list is also erased after the step of detecting a known speaker, since the speaker has been identified. Additionally, the embeddings previously fed into the clustering algorithm are always erased if possible, except when the system saves a new user or loads the database, to save computational resources and prevent unnecessary memory usage.

The additional parameter our methodology employs is the threshold k, which is set for creating new users. This threshold helps accumulate enough samples for the dynamic emergence of a new cluster in the algorithm. A low value of this parameter may prevent the creation of new clusters, as most dynamic clustering algorithms require at least 2 samples to form new clusters. For example, we tested the methodology response with

k = 1

and found that no new users were created, as a single sample was insufficient. On the other hand, for k greater than or equal to 2, we tested the system’s performance, and it remained constant in terms of correct clustering creation. Although in offline tests their performance seemed the same, the higher its value, the more robust the system is against recognition failures in real-world scenarios. However, a high value would mean not obtaining a user result for the speaker in a short time, as several iterations must occur before a new user is created. Therefore, we set the value in 3 to give the new cluster some robustness, without compromising response time when creating a new cluster.

The proposed methodology loops while the system is active, waiting for new voice activity. The system only loads the saved embeddings and obtains its clusters upon startup, avoiding potential delays that this process might cause, especially when adding new users dynamically.

Figure 2 shows an example of the database loading stage for three users. All embeddings from the dataset (red, blue, and green points) are loaded into the clustering algorithm simultaneously. Then, the algorithm partitions the embeddings into groups and assigns each embedding a cluster number. Next, the methodology determines, for each user in the dataset, the cluster containing the most of their embeddings. For example, the blue points form the first cluster. Finally, the relationship between the user ID and the cluster number is saved.

Figure 3 shows an example of the methodology flowchart for a known user (purple boxes) and an unknown user (green boxes). A known speaker needs only one embedding to be recognised. In this example, the embedding (the black triangle) is fed into the clustering algorithm, which returns 2 as the cluster number. Following the example from Figure 2, the second cluster corresponds to the user’s ID GIG9WPD9, which is the final step of the methodology. However, for an unknown user, the methodology needs three embeddings (the black crosses). As the three samples are grouped in cluster 4, which did not exist when the dataset was loaded, the method learns of a user who was not previously known and creates a new random ID to save his embeddings.

4. Methodology Evaluation

This section outlines the procedure for evaluating our proposal offline, beginning with the creation of a dataset from volunteers’ voices. Next, the experimental procedure for selecting the best voice biometric extraction model and clustering algorithm is described, encompassing static clustering of the full dataset and the incremental incorporation of new samples.

4.1. Participants for Offline Evaluation

To generate our dataset, we recruited 40 volunteers, of whom 31 are men and 9 are women, with ages ranging from 21 to 81 years old (

μ = 49.2

,

σ = 22.08

). All participants gave their explicit consent in accordance with the Data Protection Policies accepted by the University Carlos III of Madrid Committee. Half of them were recruited through a robotics workshop in the University Carlos III of Madrid Senior program, while the others are students and professors from the University. This study is exempt from Ethics Approval since it does not involve clinical studies.

The volunteers who participated in the dataset collection introduced some imbalance in the age and gender distributions. Younger participants in the 18–24 group accounted for 6 volunteers (

15 %

), and in the 25–34 group, there were 11 (27.5%). Adults in the 35–44 group accounted for 3 participants (7.5%), and in the 45–54 and 55–64 groups, there was only 1 volunteer from each (2.5%). Older adults in the 65–74 group accounted for 13 (32.5%), and in the 75–84 group, there were 5 (12.5%). Therefore, most volunteers belong to young or older adults.

Volunteers were warned that their voices would be recorded, ensuring them that their audio recordings would be anonymised and not published elsewhere, only the results obtained from them. The consent form includes a description of the experiment, a statement about data policies, and questions to collect some personal information, such as their name, age, and email address for contact.

4.2. Procedure for Offline Evaluation

The selection of the best models and algorithms for our application involved collecting the participants’ audio samples to extract their voice biometrics. First, the participants filled out the consent form, and then they were asked to read 12 Spanish phrases each while being recorded, yielding a total of 480 samples from the 40 volunteers. A technician manually started and stopped the recording. Half of the phrases were the same for all participants (see Appendix A for the translated phrases), while each participant selected the other six from a collection of Spanish books made available to users or from a chosen topic searched on the internet. The audio recording was performed under the same conditions for all participants, namely with the same microphone and electronics used in our pet-like robot, which was placed on a test bench. The microphone was hidden inside the stuffed fabric that covers the robot. The recordings were made under normal noise conditions in a research laboratory to be more realistic, and the default microphone noise cancellation system was used without any additional processing. The noise level in the laboratory was

45.63

dB, which is considered standard background noise as it is below 50 dB [63]. Each participant received a unique personal ID to anonymise the results.

After collecting all voice samples, the audio files were processed by each voice biometric extraction model to obtain their embeddings, which were saved in separate files for each model along with the users’ IDs. Vosk samples underwent a few changes from the stated implementation due to the model’s text-dependent nature. First, artificial silence was added after each audio to improve recognition since Vosk provides a result only after a few seconds of silence are detected. Finally, we discarded sentences with fewer than 10 words, as they were found not to yield consistent embeddings because they lacked sufficient information about the speaker.

The next step was to determine the optimal hyperparameter values for each clustering algorithm combined with the different embedding batches from each model. We employed the grid search method, a machine learning strategy that exhaustively searches an elected subset of values to optimise an algorithm’s hyperparameters by making combinations from a list of possibilities [64]. We decided to employ this method as hyperparameter optimisation is an NP-hard (Non-deterministic Polynomial-time hard) problem, due to the high dimensionality of the parameter space and the complexity of the algorithms. Although the grid search test values represent a robust local optimum found via systematic search, a comprehensive stability analysis remains open for future work. For each combination, we provided the algorithm with the entire batch of model embeddings and obtained the performance metrics described in Section 4.3. We repeated this procedure across all models and algorithms. The combination of hyperparameters with the best score on the metrics was saved for later comparison of the results. Table 2 shows the range of values used for each hyperparameter of each clustering algorithm.

The grid search test was performed for each algorithm on each batch of embeddings generated by each model, obtaining 20 combinations (4 algorithms × 5 embedding models). The metrics’ scores obtained in this step were primarily used to compare the impact of the hyperparameters on the algorithm. After determining the best hyperparameters for each model embedding and clustering algorithm combination, we performed a second test to identify the combination that produces the best clustering when a new sample is obtained. For this test, we performed the leave-one-out cross-validation method on the top 6 combinations from the grid search test. This method splits the collected data into two sets: one for training and the other for validation, leaving out only one sample for validation [65]. On each run, the sample removed from the training set is different, ensuring that each sample is evaluated against the trained data. After running the algorithm with each embedding as the validation sample, we computed the average metrics for all runs.

4.3. Metrics for Offline Evaluation

The evaluation of the proposed voice recognition system is conducted through a multi-dimensional analysis that balances clustering accuracy with computational feasibility. Since the system must operate autonomously on resource-constrained robotic platforms, it is essential to quantify not only how well the algorithm partitions different speakers but also its ability to maintain low latency during real-time interaction. To this end, we propose a combination of information-theoretic metrics and pair-counting algorithms to assess the internal consistency of the clusters with respect to the ground-truth labels. Figure 4 shows the time metrics measured through all the tests in the offline evaluation.

First, we propose metrics to compute the performance of the embedding extraction models:

Model inference time: It is the time the voice biometric extraction model takes from the moment an audio sample is provided to the model until the model returns its associated embedding, as shown in Figure 4a.
Model total time: It is the time the model spends from the moment the first audio sample is processed, returning its embedding, until the last audio sample is processed, going through the entire database, as shown in Figure 4b.

The following metrics are employed for hyperparameter optimisation and the cross-validation test:

External clustering metrics:
-
Adjusted Rand Index (ARI): ARI computes a similarity measure between the predicted clustering and the ground-truth labels based on pair counting, adjusting for chance. Its values range from $- 1$ (less agreement than expected by chance) to 1 (perfect match), with 0 corresponding to random labelling [66,67].
-
Adjusted Mutual Information (AMI): AMI compares the predicted clustering with the ground-truth labels using the information theory (mutual information), adjusting for chance. Its values are between 0 (random agreement) and 1 (perfect agreement) [68].
-
V-Measure: This metric is the harmonic mean of homogeneity and completeness. Homogeneity measures whether each cluster contains only objects from a single class, while completeness measures whether all objects of a given class are assigned to the same cluster. The V-Measure ranges from 0 to 1, where 1 stands for perfectly complete labelling [69].
Internal clustering metrics:
-
Silhouette Coefficient: This coefficient provides a graphical display of the clusters’ silhouettes, showing how well the objects are classified within each cluster. The Silhouette Coefficient goes from $- 1$ to 1, where 1 is the best value and 0 means overlapping clusters [70].
Custom evaluation metrics:
-
Clusters: The number of clusters created by the clustering algorithm to group all the embeddings.
-
Clustering latency: The amount of time the clustering algorithm takes to process all the embeddings and provide a result, as shown in Figure 4b.
-
Success rate: It measures the number of samples that have been correctly clustered during the leave-one-out cross-validation tests.
-
Validation inference time: It is the time the clustering algorithm spends during the leave-one-out cross-validation test to cluster all embeddings for each training-validation dataset, as shown in Figure 4c.
-
Validation total time: It is the time the leave-one-out cross-validation test takes, as shown in Figure 4c.

External clustering metrics, such as ARI, AMI, and V-Measure, allow comparing the predicted labels against the ground-truth labels, thereby providing a measure of the similarity between the cluster prediction and the real group of embeddings per speaker. On the other hand, internal clustering metrics, such as the Silhouette Coefficient, evaluate clustering quality only from the predicted data, without requiring ground-truth labels. These metrics let us verify whether the clustering algorithms produce coherent results for the dataset used, which is important because we are not formally training the algorithms and cannot control their output or provide feedback to teach them their mistakes.

4.4. Results of Offline Evaluation

This section presents the results of the search for the best setup for our application. First, we show the performance of voice biometric embedding extraction per model. Then, we optimise the hyperparameters of the embedding–clustering combinations. Finally, we perform cross-validation to determine which combination yields the best clustering success rate for new samples from known speakers.

4.4.1. Embedding Extraction Time Performance

The voice biometric extraction models produced one embedding per audio file, yielding a total of 480 embeddings per model. However, Vosk generated a different set of embeddings, 506 vectors, providing more than one embedding for some long audio files when detected pauses in the user’s speech and no result for others when the speech was not correctly recognised.

Table 3 shows the average time the model needs to generate an embedding from each audio file in the database (inference time), the time spent by each model to process the entire database (total time), and the number of embeddings generated. In terms of speed to produce the embeddings, SpeechBrain TDNN is the fastest model with

0.094

s, followed by Resemblyzer with

0.145

s. Models such as Vosk (

0.645

s) and SpeechBrain ECAPA-TDNN (

3.008

s) fall within the moderate response-time range. Finally, SpeechBrain ResNet requires

28.448

s to generate the embedding from the audio files in the dataset. Following a similar tendency, SpeechBrain TDNN is the fastest to process the entire dataset, followed by Resemblyzer (

69.790

s) and Vosk (

327.280

s). SpeechBrain ECAPA-TDNN (1479.945 s) and ResNet (13,654.823 s) produce very slow times compared to the others.

4.4.2. Hyperparameter Optimisation for Each Embedding–Clustering Combination

Table 4 shows the hyperparameters that produced the best results for each voice biometric extraction model and clustering algorithm combination using the grid search method. DBSTREAM hyperparameters are the same regardless of the voice biometrics extraction model. The explanation for this outcome is that DBSTREAM always produces 2 clusters across all hyperparameter combinations, failing to optimise its performance.

Table 5 shows the ranking of the best voice embedding–clustering combinations (adjusted with the hyperparameters from Table 4). The ranking was determined after evaluating all metrics in accordance with the objective of our application. First, we consider the number of clusters as the most important metric, as it estimates the number of speakers in the dataset and the aim of the work is to correctly detect speakers. Second, we select ARI next, as it measures the similarity between true and predicted clusters, indicating how well the clustering prediction matches the ground-truth. Lastly, we consider the Silhouette Coefficient the third most important metric, as it provides information on the quality of the clusters without the ground-truth labels. Configurations were ranked by prioritising these criteria sequentially, with each metric breaking ties from the previous one.

The combination with the best scores was BIRCH with Vosk, which generated the exact number of clusters as participants. This combination has the best scores in AMI (

0.936

) and V-Measure (

0.957

). However, for ARI and the Silhouette Coefficient, the best match is IncDBSCAN with Resemblyzer’s embeddings, with values of

0.824

and

0.475

, respectively. In terms of time, BIRCH is the fastest algorithm to process data, especially with the SpeechBrain TDNN model (

0.055

s).

4.4.3. Evaluation of New Samples Detection

The cross-validation test was performed using the six combinations that yielded the best scores in the grid search test. The selected voice biometric extraction models are Vosk, SpeechBrain ResNet, and Resemblyzer, while the clustering algorithms are BIRCH and IncDBSCAN. Table 6 shows the average scores obtained for each combination, sorted by success rate. This metric measures how well the clustering method classifies new embeddings into the clusters of known users, which implies correctly recognising the user. The combination with the highest success rate is IncDBSCAN with Vosk’s embeddings,

91.3 %

, followed by BIRCH and Vosk with almost an

85 %

success rate. In terms of speed, BIRCH is faster than IncDBSCAN, as already observed in the previous tests, spending

0.215

s less time to generate the output for each train-validation dataset. Vosk is the voice biometric extraction model with the best success rate, combined with both BIRCH and IncDBSCAN clustering algorithms.

5. Integration in a Pet-like Social Robot

This section describes our pet-like robot, the platform on which we have integrated our user recognition system. We propose a real-world scenario to preliminarily validate our method’s performance when both known and unknown users interact with the robot and present the results of this evaluation.

5.1. Our Pet-like Social Robot

This work has been integrated into a pet-like social robot [71] for older adults’ companionship and affective support, shown in Figure 5. Although the methodology described in this manuscript was designed to enhance our robot’s skills, it can be applied to robots equipped with a microphone, a processing unit running the Robot Operating System (ROS), a voice biometrics extractor model, and a library to perform the preferred clustering algorithm.

The platform is a small, friendly, and fluffy rabbit-like robot for human–robot interaction research. More specifically, it was conceived to study whether pet-like robots can fulfil the role of real animals in providing animal-assisted therapy, but without the limitations of dirt and sanitation. The robot has numerous sensors to perceive the environment, including a stereo microphone with integrated noise cancellation, four touch sensors in the ears, forehead, and back, a radiofrequency (RFID) reader used to take care and play with the robot using RFID cards that simulate food or toys, different temperature sensors that are used to avoid overheating, and an inertial measurement unit to sense movement changes. Regarding its actuation capabilities, the robot has five motors to move the ears, nose, back (to simulate breathing), and tail, and a haptic motor in the belly for vibration feedback. The robot has two coloured lights in the cheeks that simulate blushing and different emotional states, and a stereo speaker to play non-verbal natural sounds. The robot includes a 7000 mAh capacity battery system, making it portable and allowing it to work for more than 4 h. Its processing unit is a Raspberry Pi 5 (Raspberry Pi Ltd., Cambridge, UK) with Debian 12 operating system and ROS 2 software to implement a modular software architecture. This architecture, represented in Figure 6, consists of a Perception Manager, a Human–Robot Interaction Manager, a Decision-Making System connected to some gaming skills, and an Expression Manager.

The Perception Manager receives raw sensor data and generates a unified message containing the sensors’ information in a standard format that the other modules can understand. The Human–Robot Interaction Manager controls the robot’s communicative acts, resolving conflicts when more than one expression is required simultaneously (e.g., reacting to a caress and making a sound for a game). The Decision-Making System selects the most appropriate behaviour based on the robot’s context, derived from its sensors. These behaviours have been designed to respond to physical stimuli (e.g., user touches or RFID cards simulating food or toys) and user instructions (e.g., requests to begin a game). Finally, the Expression Manager receives actuation orders from the Human–Robot Interaction Manager and, using specific managers called players, controls the execution of expressions stored in a database.

The user recognition system proposed in this manuscript is integrated into the Voice translator in the Perception Manager. It receives audio from the microphone and converts the user’s speech into text using a speech-to-text tool. In addition, the voice translator performs speaker recognition using voice biometrics extraction and clustering. Finally, the Perception Manager sends the user to the Decision-Making System for further adaptation of the robot’s behaviour.

5.2. Participants for Online Evaluation

The participants in this online evaluation comprised 10 people: 5 had already participated in the offline evaluation, serving as known users, and 5 were new participants who had not spoken to the robot, serving as unknown users. This set of participants consisted of 7 men and 3 women, aged 21 to 39 years (

μ = 28.60

,

σ = 6.64

). The newly recruited people also voluntarily gave their explicit consent in accordance with the Data Protection Policies accepted by the University Carlos III of Madrid Committee by signing the same consent form as in the previous evaluation. They are students and professors from our university. Once again, this evaluation is exempt from Ethics Approval since it does not involve clinical studies. The participants were notified that only their voice biometrics, but not their audio files, would be recorded, as described in the methodology section.

5.3. Procedure for Online Evaluation

The offline evaluation helped determine the best combination of a voice biometric extraction model and a clustering algorithm for a set of known users. However, their results do not show the methodology’s performance when encountering unknown users and when using online information. Therefore, we integrated the user recognition methodology with the best combination into the robot and tested its performance in a real-world scenario.

The scenario consisted of a single user reading to the robot for 3–4 min. This procedure was repeated under the same conditions for the 10 participants. For these scenarios, we loaded the voice embeddings obtained from the previous evaluation into the robot’s database and relaunched the system for each participant. Therefore, the 5 participants whose voice biometrics were in the database were considered known users, while the other 5, who were new participants, were considered unknown users. During the scenarios, we recorded the number of interactions and whether the method correctly recognised known speakers and created new clusters for unknown ones.

5.4. Metrics for Online Evaluation

The performance of the system for the online evaluation was assessed using the following metrics:

Number of interactions: The number of times the system extracts an embedding from the audio captured and provides a cluster result when participants talk to the robot.
Correct detections: It is the number of times the clustering algorithm returns the correct cluster given a new embedding. For known users, it refers to the cluster associated with that user. For unknown users, we consider a correct detection when the clustering algorithm creates a new cluster for the user and assigns the new embedding from this user to it. It is important to note that the system sometimes assigns different people to the same cluster when loading the robot’s database for the first time because it cannot distinguish between them. In this situation, we consider a correct detection if the cluster returned by the system for a new embedding matches the initial prediction. We also report the correct detections as a percentage.
Incorrect detections: We consider an incorrect detection when the clustering algorithm returns the cluster of a different user than the one speaking. The cluster must be from an existing user, excluding noise detection. In addition, we report the incorrect detections as a percentage.
Noise detection: This is the number of times the clustering algorithm classifies an embedding as noise, not providing a valid cluster number. We exclude from this counter the noise detections that are later treated as a new user’s cluster, as they are necessary to generate a profile for an unknown speaker. Additionally, the noise rate is reported as the percentage of noise detections.
New cluster detections: It is the number of times a new cluster is detected across the interactions. For known speakers, the cluster must be different from the one associated with their IDs. For unknown speakers, we do not count the first cluster created for them. Additionally, we report the new cluster detection as a percentage.
New clusters created: It refers to the number of new clusters created across the interactions for a given user. The appearance of a new cluster when an unknown speaker is talking to the robot is considered a good sign of recognising a new user, and its detection is considered a correct detection. Meanwhile, for a known speaker, it does not count as a correct or incorrect detection.
Time to cluster database: Time the clustering algorithm spends loading and obtaining the cluster for the database of known users, as shown in Figure 7.
Embedding-to-cluster latency: It is the time from the moment the model produces an embedding until a cluster number is obtained, as shown in Figure 7.

5.5. Results of Online Evaluation

The results of the offline tests presented in Section 4.4 revealed that the Vosk voice biometrics extraction model and the IncDBSCAN clustering algorithm are the best combination for our application and dataset. Therefore, these techniques were integrated into our pet-like robot to test their user recognition capabilities, namely, correctly identifying known speakers and creating new user profiles when encountering unknown speakers. Table 7 presents the user recognition results for known and unknown speakers obtained using the methodology implemented on the robot. For known users, we had a total of 116 interactions with an average of

23.20

interactions per user, of which 85 were successful (

73.28 %

) as the user was recognised correctly, and 1 was incorrect (

0.86 %

) as a different user from the database than the one speaking was detected. In addition,

12.07 %

of the interactions were classified as noise (14 interactions), resulting in no user recognition. User Known 1 was correctly detected twice at the start of the test. However, the third time the user interacted with the robot, the algorithm considered that the voice biometrics had changed and created a new cluster. The remaining interactions (16) stayed consistent in detecting this new user. Additionally, during this user’s interactions, he was mistaken for another user and was discarded as noise twice.

For the unknown users, we had a total of 107 interactions, with an average of

21.40

interactions per user, of which 76 were successful (

71.03 %

), and none were incorrect. The noise rate for the unknown users is

22.43 %

(24 interactions). These interactions created 6 new users when they should have created only 5, since that is the actual number of new speakers. This anomaly was produced by interactions with user Unknown 5, for which the clustering algorithm assigned two clusters: one with 10 detections and the other with 7. In total, across all 223 interactions, we obtained a success rate of

72.20 %

, a failure rate of

0.45 %

, and a noise rate of

17.04 %

.

This methodology, averaged over all interactions, consumed

4.120 %

of the Computer Processing Unit (on a Raspberry Pi 5) and used

398.998

MB of RAM. The mean time to cluster the database was

0.531

s. The mean embedding-to-cluster latency is

5.376

ms. These metrics show the methodology’s performance in real-time scenarios, demonstrating its ability to respond in less than a second and without requiring high computational resources.

6. Discussion

This work compares different voice biometric extraction models and clustering algorithms to demonstrate how the best combination, when applied with our methodology, performs in a real scenario. The voice biometric model selection compared the models based on inference time and the total time each model spends obtaining embeddings from an audio sample and from the entire dataset. In this task, SpeechBrain ECAPA-TDNN and ResNet models were the slowest, taking

3.01

s and

28.45

s, respectively, indicating that they struggle to operate in real time. The remaining models had valid times for real-time interactions, as they take less than a second to obtain each embedding. Another aspect to consider is the dimensionality of the embeddings produced by each model, given the memory consumption required to store and process high-dimensional vectors. Vosk has the lowest dimensionality, with embeddings of length 128, followed by the SpeechBrain ECAPA-TDNN model with 192 values. Meanwhile, the SpeechBrain ResNet model and Resemblyzer have embeddings of length 256, and the SpeechBrain TDDN model has embeddings of length 512. Therefore, in terms of speed and memory usage, the model with the best overall balance between performance and resource efficiency is Vosk.

The grid search test helped discard the worst-performing embedding extraction model and clustering algorithm combinations. The algorithms DenStream and DBSTREAM were primarily rejected due to poor performance, especially regarding the number of clusters created. A clustering algorithm that cannot even create an approximation of the ground-truth number of clusters is not valid for our methodology. In addition, DenStream was the slowest, and DBSTREAM had unsatisfactory scores on the measured metrics. On the other hand, BIRCH and IncDBSCAN both produced high scores and required little time to process all the embeddings. Additionally, both algorithms produced

40.8 \pm 4.27

and

39.6 \pm 4.72

clusters across all combinations, which are close to the 40 participants in the offline evaluation. Birch and IncDBSCAN covered the top 10 best combinations and are viable options for our methodology. As for the voice biometric extraction models, the top 6 best combinations are covered by Vosk, Resemblyzer, and SpeechBrain ResNet, which discards the SpeechBrain TDNN and ECAPA-TDNN models.

The leave-one-out cross-validation test is the final step in selecting the model and algorithm integrated into our methodology, focusing on the success rate of each combination while still considering the other metrics. The top voice biometric extraction model is again Vosk, achieving both first and second places with success rates of

91.30 %

combined with IncDBSCAN and

84.98 %

using BIRCH. The success rates for Resemblyzer and the SpeechBrain ResNet model differed depending on the clustering algorithm used, obtaining the best results with IncDBSCAN (

81.04 %

and

72.50 %

, respectively). As for the clustering algorithm, IncDBSCAN had the best success rates, as mentioned before. Although IncDBSCAN is slightly slower than BIRCH, it is fast enough to run in real-time scenarios. Therefore, the selected combination is the clustering algorithm IncDBSCAN and the voice biometric extraction model Vosk.

The offline evaluation aims to select the best embedding–clustering combination based on accuracy and speed. The evaluation decisions were taken prioritising a better performance over a faster response. Bo et al. [72] suggest that higher accuracy is more important than faster speed, as it has a greater impact on user perception and improves their experience. Additionally, the literature [73,74,75] supports the importance of good performance, showing that it affects users’ trust in human–robot interactions. As the system achieved good accuracy (

91.30 %

) and latency (

0.281

s) with the selected combination, we conducted an online evaluation to test the methodology for unknown speakers and to measure the computational cost in a real-time scenario, which is essential for low-constraint resource robots.

The chosen combination was integrated into our software architecture and implemented in the robot. Then, we tested our methodology in a scenario where the robot encountered 5 known speakers to verify correct user recognition and 5 unknown speakers to check new user creation in the system and subsequent recognition. For the known users’ interactions, the success rate was

73.28 %

, and was affected by numerous noise detections (

12.07 %

). However, in our application, noise detection is not a problem as long as the user is frequently detected and there are no incorrect detections. The reason for this is that the first time the speaker is recognised, its ID will be loaded as the current user, and this information will not change until a different speaker is recognised or the robot is shut down. For the unknown users, we did not find significant differences compared to the outcomes for known users, since the success rate was quite similar,

71.03 %

. However, for unknown users, the noise rate increased to

22.43 %

. Once again, we consider this noise rate as acceptable for our application since the user is consistently recognised and there are no recognition errors that classify a user into the wrong cluster. The failure rate across all interactions is

0.45 %

, which is low given the high number of interactions (223).

The results reported interesting situations for users Known 1 and Unknown 5, as shown in Table 7. In both cases, our methodology generated new clusters during their interactions. For the user Known 1, although he had been previously recognised, the system created a new cluster and classified 16 embeddings in it. This case indicates that for 16 interactions, the user’s voice biometrics were identified as new, but all matched the same clustering group. Similarly, for the user Unknown 5, the system created two new clusters. Initially, the algorithm assigned 10 embeddings to the first cluster and, later, assigned 7 embeddings to the second new cluster. This case implies that, for 10 interactions, the user was defined as new and a new profile was created. Then, for the following 7 interactions, he was again recognised as a new user and a second profile was generated. Occasionally, some noise was detected between interactions. This situation has a negative consequence: when the new cluster is created, the user’s ID also changes, resulting in at least two IDs for the same speaker. As the detections remain constant, it seems unlikely that the user can be recognised with the previous ID, so the information associated with it may not be accessed again by the system. Nevertheless, this situation occurred for only two of the ten users, and it is not an unacceptable mistake for our application. A possible solution could be to forget old IDs that have not been detected for some time.

Our application aims to recognise the user in order to learn their needs and preferences and to adapt the robot’s behaviour in future work. Our pet-like robot does not talk; it only expresses its intentions through nonverbal sounds and gestures, implying that the user is never referred to by name. Therefore, even if the recognition system detects excessive noise, which may result in occasional non-recognition of the user, or cluster fragmentation occurs, the user would not notice this in the early stages of interaction, as the robot can continue responding with a default configuration. However, this problem may become more important if the user spends enough time interacting with the robot to understand its adaptive behaviour.

Both evaluations had a set of volunteers who provided their voices. The age and gender distribution of the participants resulted in some imbalance that may affect the reliability of the evaluations. To determine whether this situation affects the methodology’s performance, we analysed the dataset used in the offline evaluation, with embeddings extracted using Vosk, employing t-Distributed Stochastic Neighbour Embedding (t-SNE), as shown in Figure 8. Figure 8a shows the gender distribution of the dataset, which presents some differences between males and females, but some points are mixed. Figure 8b shows the dataset’s age distribution, which does not present a clear difference between age groups. Therefore, the dataset does not seem to be affected by age or gender. Furthermore, because we employ dynamic clustering algorithms that are not formally trained, the methodology cannot exhibit discrimination against men or older adults.

Additionally, both evaluations required volunteers to read rather than to speak spontaneously, raising the question of how the system would perform with talking users. We decided to base the online evaluation on reading speakers again to maintain replicability and coherence across evaluations. Although talking and reading speeches may differ even within the same speaker, as long as the interaction is consistent, it should not affect the outcome. Furthermore, in a real-world setting, the dataset would start empty and would dynamically grow from spontaneous interactions. However, future work should confirm this statement.

To sum up, our methodology uses the embeddings extracted from the Vosk model and classifies them with the clustering algorithm IncDBSCAN, due to their good performance and success rate (

91.30 %

) in offline tests. In real-world scenarios using the pet-like robot, our methodology reduces the correct detection rate to

72.20 %

. However, it still performs well for our application, as most of the other detections are noise that do not yield a recognition result.

Limitations

The realisation of this approach reveals technical limitations derived from the robot’s hardware and the development of the proposed methodology for speaker recognition. Our methodology relies on hardware components, such as the microphone and the robot’s external cover, which may affect sound conditions. Therefore, the results presented in our scenario might change under different audio conditions.

From a technical perspective, the proposed methodology presents the following limitations. Regarding the voice biometrics extraction model, we noticed differences in embeddings between short and long sentences. Therefore, we required the user to read phrases of 10 or more words to obtain a valid dataset. Additionally, in the robot implementation, we forced Vosk, leveraging its text-dependent nature, to provide an embedding only when this condition was met. This constraint may affect the system’s usability in a real-world scenario, as users often speak in short sentences, resulting in the user not being recognised. Hence, future work should verify this limitation and adjust the number of words to determine the optimal minimum value.

Concerning the clustering algorithm, the hyperparameters presented in Table 4 are specific to our scenarios, application, and dataset. Consequently, this configuration might not be the best for other cases. However, we believe that our methodology can be easily extrapolated to other robots with similar hardware components and applications.

Finally, from an evaluation perspective, the datasets obtained for both the offline and online evaluations, as a first approach to the system’s performance, have a relatively small number of volunteers. Thus, subsequent studies with a larger number of participants or publicly available datasets, would be valuable future work. In addition, the proposed methodology has been tested with one speaker at a time so far. Therefore, we have not assessed its performance with multiple speakers simultaneously.

7. Conclusions

User recognition is an interesting functionality in social robotics that enables the robot to identify the user and adapt its behaviour to the user’s preferences and needs. This recognition is often performed using a camera and computer vision to detect the user’s face. However, some robots, such as pet-like robots, lack the hardware and computational resources to apply this. An alternative approach is to extract voice biometrics to distinguish between speakers.

This paper presents a user recognition methodology based on voice biometrics and dynamic clustering for known and unknown speakers for adaptive social robots. The methodology requires no internet connection and can run on computers with limited computational resources, such as a Raspberry Pi. It provides enough fast responses to run in real-time scenarios and is unsupervised, dynamically adding new users when it detects an unknown speaker. It requires a microphone, a voice biometric extraction model, and a clustering algorithm to work, which makes it feasible for most robotic platforms. The system’s evaluation used available open-source models and algorithms to identify the best combination for our application, which returned the Vosk model and the IncDBSCAN algorithm as the best approach, with recognition rates above

70 %

for both known and unknown speakers in a proof-of-concept online evaluation.

Despite the system’s promising performance, it has a few limitations. First, it depends on hardware elements such as the microphone’s disposal. Second, speeches with fewer than 10 words are not considered for user recognition. Additionally, the algorithms’ hyperparameters obtained are specific to our scenario. In addition, system performance with multiple simultaneous speakers has not been validated.

The overall goal of this development is to obtain a framework that enables user recognition from voice biometrics for personalised and adaptive human–robot interaction. The following steps in this research line would be to develop pet-like social robots that can adapt their behaviour to the different users they encounter. We would also like our methodology to include speaker diarisation to distinguish between users speaking to the robot simultaneously. In addition, fusing voice information with perception data from other modalities, such as inertial measurements or the user’s habits when using the robot, would enable the robot to learn not only how users speak but also how they behave, generating activities based on the user’s needs. Additionally, although Vosk is a text-dependent model that helps reduce noise input to the system when the user is not speaking, a possible approach for noisy environments is to research speech enhancement models. Lastly, we would like to explore the system’s scalability for longer interactions and large numbers of users to determine whether memory usage, an important resource on low-constraint robots, remains low under these conditions. In addition, this study could be conducted across different environments and with different microphones to test the system’s robustness, along with Transfer Learning to reuse information obtained under specific conditions in different ones.

Author Contributions

Conceptualization, A.S.-B., M.M.-G., J.J.G.-M. and J.C.C.; methodology, A.S.-B., M.M.-G., J.J.G.-M. and J.C.C.; software, A.S.-B. and M.M.-G.; validation, A.S.-B., M.M.-G. and J.J.G.-M.; formal analysis, A.S.-B. and M.M.-G.; investigation, A.S.-B., M.M.-G. and J.J.G.-M.; resources, J.C.C.; data curation, A.S.-B. and M.M.-G.; writing—original draft preparation, A.S.-B. and M.M.-G.; writing—review and editing, A.S.-B., M.M.-G., J.J.G.-M. and J.C.C.; visualization, A.S.-B. and M.M.-G.; supervision, M.M.-G. and J.C.C.; project administration, J.C.C.; funding acquisition, J.C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work has received funding from grant PID2024-157304OB-I00 funded by MICIU/AEI/10.13039/501100011033 and by the “European Union”; and from grant PID2022-140345OB-I00 funded by MICIU/AEI/10.13039/501100011033 and by the “European Union”. It has also been supported by the project Percepción Multimodal Afectiva en Robots Sociales para Personas Mayores, funded by Universidad Carlos III de Madrid through the “Ayudas para la Actividad Investigadora de los Jóvenes Doctores” of the Programa Propio de Investigación.

Institutional Review Board Statement

The University Carlos III of Madrid accepted the realisation of this experiment after approval of the Data Protection Policies Committee.

Informed Consent Statement

All participants gave their explicit consent to participate in this experiment.

Data Availability Statement

Data is available upon request due to privacy constraints of the research projects involved.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Phrases Used for Dataset Recording

All volunteers read the following sentences during the dataset recording. The phrases were in Spanish, but we provide their English translation:

Today is a good day to go for a walk in the park and enjoy the sun, even though the weather is constantly changing.
The penguin walks slowly along the beach while the dog barks in the garden.
The blue house is near the river and has a huge garden full of colourful flowers that perfume the air in spring.
Today the wind is blowing strongly, and clouds cover the entire sky, while the branches of the trees move as if they want to dance with the storm.
The snowy mountains seem to glow in the light of dawn, and the valley’s silence is broken only by the distant song of birds.
It is always good to share a meal with family and friends, because around the table, sincere conversations, happy memories and new smiles are born.

References

Chan, J.; Nejat, G. Social Intelligence for a Robot Engaging People in Cognitive Training Activities. Int. J. Adv. Robot. Syst. 2012, 9, 51171. [Google Scholar] [CrossRef] [PubMed]
Donnermann, M.; Schaper, P.; Lugrin, B. Integrating a Social Robot in Higher Education—A Field Study. In 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN); IEEE: New York, NY, USA, 2020; pp. 573–579. [Google Scholar]
Koh, W.Q.; Ang, F.X.H.; Casey, D. Impacts of Low-cost Robotic Pets for Older Adults and People With Dementia: Scoping Review. JMIR Rehabil. Assist. Technol. 2021, 8, e25340. [Google Scholar] [CrossRef]
Bharatharaj, J.; Huang, L.; Al-Jumaily, A.M. Bio-inspired therapeutic pet robots: Review and future direction. In 2015 10th International Conference on Information, Communications and Signal Processing (ICICS); IEEE: New York, NY, USA, 2015. [Google Scholar] [CrossRef]
Ruggiero, A.; Mahr, D.; Odekerken-Schröder, G.; Spena, T.R.; Mele, C. Companion robots for well-being: A review and relational framework. In Research Handbook on Services Management; Edward Elgar Publishing: Cheltenham, UK, 2022; pp. 309–330. [Google Scholar]
Berridge, C.; Zhou, Y.; Robillard, J.M.; Kaye, J. Companion robots to mitigate loneliness among older adults: Perceptions of benefit and possible deception. Front. Psychol. 2023, 14, 1106633. [Google Scholar] [CrossRef]
Gasteiger, N.; Hellou, M.; Ahn, H.S. Factors for Personalization and Localization to Optimize Human–Robot Interaction: A Literature Review. Int. J. Soc. Robot. 2021, 15, 689–701. [Google Scholar] [CrossRef]
Di Napoli, C.; Ercolano, G.; Rossi, S. Personalized home-care support for the elderly: A field experience with a social robot at home. User Model. User-Adapt. Interact. 2023, 33, 405–440. [Google Scholar] [CrossRef]
Maroto-Gómez, M.; Alonso-Martín, F.; Malfaz, M.; Castro-González, Á.; Castillo, J.C.; Salichs, M.Á. A systematic literature review of decision-making and control systems for autonomous and social robots. Int. J. Soc. Robot. 2023, 15, 745–789. [Google Scholar] [CrossRef]
Maroto-Gómez, M.; Castro-González, Á.; Castillo, J.C.; Malfaz, M.; Salichs, M.Á. An adaptive decision-making system supported on user preference predictions for human–robot interactive communication. User Model. User-Adapt. Interact. 2023, 33, 359–403. [Google Scholar] [CrossRef]
Maroto-Gómez, M.; Lewis, M.; Castro-González, Á.; Malfaz, M.; Salichs, M.Á.; Cañamero, L. Adapting to my user, engaging with my robot: An adaptive affective architecture for a social assistive robot. ACM Trans. Intell. Syst. Technol. 2024, 15, 125. [Google Scholar] [CrossRef]
Arango, J.A.R.; Marco-Detchart, C.; Inglada, V.J.J. Personalized Cognitive Support via Social Robots. Sensors 2025, 25, 888. [Google Scholar] [CrossRef]
Jain, A.K.; Ross, A.; Prabhakar, S. An introduction to biometric recognition. IEEE Trans. Circuits Syst. Video Technol. 2004, 14, 4–20. [Google Scholar] [CrossRef]
Yan, H.; Ang, M.H.; Poo, A. A Survey on Perception Methods for Human–Robot Interaction in Social Robots. Int. J. Soc. Robot. 2013, 6, 85–119. [Google Scholar] [CrossRef]
Yang, D.; Chae, Y.J.; Kim, D.; Lim, Y.; Kim, D.H.; Kim, C.; Park, S.K.; Nam, C. Effects of social behaviors of robots in privacy-sensitive situations. Int. J. Soc. Robot. 2022, 14, 589–602. [Google Scholar] [CrossRef]
Prabakaran, D.; Shyamala, R. A Review On Performance Of Voice Feature Extraction Techniques. In 2019 3rd International Conference on Computing and Convergence Technology; IEEE: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
Kozhirbayev, Z.; Erol, B.A.; Sharipbay, A.; Jamshidi, M. Speaker recognition for robotic control via an iot device. In 2018 World Automation Congress (WAC); IEEE: New York, NY, USA, 2018; pp. 1–5. [Google Scholar]
Tuasikal, D.A.A.; Fakhrurroja, H.; Machbub, C. Voice Activation Using Speaker Recognition for Controlling Humanoid Robot. In 2018 IEEE 8th International Conference on System Engineering and Technology (ICSET); IEEE: New York, NY, USA, 2018. [Google Scholar] [CrossRef]
Alonso-Martín, F.; Salichs, M.A. Integration of a voice generation systems in a social robot. Cybern. Syst. 2011, 42, 215–245. [Google Scholar] [CrossRef]
Foggia, P.; Greco, A.; Roberto, A.; Saggese, A.; Vento, M.; Foggia, P.; Greco, A.; Roberto, A.; Saggese, A.; Vento, M. Few-shot re-identification of the speaker by social robots. Auton. Robot. 2022, 47, 181–192. [Google Scholar] [CrossRef]
Amirgaliyev, B.; Mussabek, M.; Rakhimzhanova, T.; Zhumadillayeva, A. A review of machine learning and deep learning methods for person detection, tracking and identification, and face recognition with applications. Sensors 2025, 25, 1410. [Google Scholar] [CrossRef]
Wiskott, L.; Fellous, J.; Krüger, N.; von der Malsburg, C. Face recognition by elastic bunch graph matching. In Proceedings of International Conference on Image Processing; IEEE: New York, NY, USA, 1997. [Google Scholar] [CrossRef]
Wang, Y.; Shen, J.; Petridis, S.; Pantić, M. A real-time and unsupervised face re-identification system for human–robot interaction. Pattern Recognit. Lett. 2019, 128, 559–568. [Google Scholar] [CrossRef]
Khalifa, A.; Abdelrahman, A.A.; Strazdas, D.; Hintz, J.; Hempel, T.; Al-Hamadi, A. Face Recognition and Tracking Framework for Human–Robot Interaction. Appl. Sci. 2022, 12, 5568. [Google Scholar] [CrossRef]
Mekruksavanich, S.; Jitpattanakul, A. Biometric User Identification Based on Human Activity Recognition Using Wearable Sensors: An Experiment Using Deep Learning Models. Electronics 2021, 10, 308. [Google Scholar] [CrossRef]
Lu, Z.; Wang, R.; Zhou, H.; Dong, N.; Lv, H.; Yang, G. A Novel Gait Identity Recognition Method for Personalized Human–robot Collaboration in Industry 5.0. Chin. J. Mech. Eng. 2025, 38, 191. [Google Scholar] [CrossRef]
Álvarez-Aparicio, C.; Guerrero-Higueras, A.M.; González-Santamarta, M.Á.; Campazas-Vega, A.; Matellán, V.; Fernández-Llamas, C. Biometric recognition through gait analysis. Sci. Rep. 2022, 12, 14530. [Google Scholar] [CrossRef]
Al-Qaderi, M.; Rad, A. A Multi-Modal Person Recognition System for Social Robots. Appl. Sci. 2018, 8, 387. [Google Scholar] [CrossRef]
Freire-Obregón, D.; Rosales-Santana, K.; Marín-Reyes, P.A.; Peñate-Sánchez, A.; Lorenzo-Navarro, J.; Castrillón-Santana, M. Improving user verification in human–robot interaction from audio or image inputs through sample quality assessment. Pattern Recognit. Lett. 2021, 149, 179–184. [Google Scholar] [CrossRef]
Folorunso, C.; Asaolu, O.; Popoola, O. A review of voice-base person identification: State-of-the-art. Covenant J. Eng. Technol. 2019, 3, 36–57. [Google Scholar]
Bai, Z.; Zhang, X.L.; Chen, J. Speaker recognition based on deep learning: An overview. Neural Netw. 2021, 140, 65–99. [Google Scholar] [CrossRef]
Campbell, J. Speaker recognition: A tutorial. Proc. IEEE 1997, 85, 1437–1462. [Google Scholar] [CrossRef]
Brydinskyi, V.; Khoma, Y.; Sabodashko, D.; Podpora, M.; Khoma, V.; Konovalov, A.; Kostiak, M. Comparison of Modern Deep Learning Models for Speaker Verification. Appl. Sci. 2024, 14, 1329. [Google Scholar] [CrossRef]
Faundez-Zanuy, M.; Monte-Moreno, E. State-of-the-art in speaker recognition. IEEE Aerosp. Electron. Syst. Mag. 2005, 20, 7–12. [Google Scholar] [CrossRef]
Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
Shome, N.; Sarkar, A.; Ghosh, A.K.; Laskar, R.H.; Kashyap, R. Speaker Recognition through Deep Learning Techniques: A Comprehensive Review and Research Challenges. Period. Polytech. Electr. Eng. Comput. Sci. 2023, 67, 300–336. [Google Scholar] [CrossRef]
Steck, H.; Ekanadham, C.; Kallus, N. Is cosine-similarity of embeddings really about similarity? In WWW ’24: Companion Proceedings of the ACM Web Conference 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 887–890. [Google Scholar]
Farrell, K.; Mammone, R.J.; Assaleh, K. Speaker recognition using neural networks and conventional classifiers. IEEE Trans. Speech Audio Process. 1994, 2, 194–205. [Google Scholar] [CrossRef]
Bansé, D.; Doddington, G.R.; Garcia-Romero, D.; Godfrey, J.J.; Greenberg, C.S.; Martin, A.F.; McCree, A.; Przybocki, M.; Reynolds, D.A. Summary and initial results of the 2013–2014 speaker recognition i-vector machine learning challenge. In Proceedings of the Interspeech 2014, Singapore, 14–18 September 2014; pp. 368–372. [Google Scholar]
Kiani, K.; Baniasadi, A. Speaker Recognition System based on Identity Vector using T-SNE Visualization and Mean-shift Algorithm. In 2019 5th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS); IEEE: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
Sadjadi, S.O.; Kheyrkhah, T.; Tong, A.; Greenberg, C.S.; Reynolds, D.A.; Singer, E.; Mason, L.P.; Hernandez-Cordero, J. The 2016 NIST Speaker Recognition Evaluation. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017. [Google Scholar]
Sadjadi, S.O.; Greenberg, C.S.; Singer, E.; Reynolds, D.A.; Mason, L.P.; Hernandez-Cordero, J. The 2019 NIST Audio-Visual Speaker Recognition Evaluation. In Proceedings of the Odyssey 2020, Tokyo, Japan, 1–5 November 2020; pp. 259–265. [Google Scholar]
Sadjadi, S.O.; Greenberg, C.S.; Singer, E.; Mason, L.; Reynolds, D.A. The 2021 NIST Speaker Recognition Evaluation. In Proceedings of the Odyssey 2022: The Speaker and Language Recognition Workshop, Beijing, China, 28 June–1 July 2022; pp. 322–329. [Google Scholar] [CrossRef]
Cephei, A. Vosk Offline Speech Recognition API. 2025. Available online: https://alphacephei.com/vosk/ (accessed on 13 February 2026).
Asha, C.; D’Souza, J.M. Voice-Controlled Object Pick and Place for Collaborative Robots Employing the ROS2 Framework. In 2024 International Conference on Advances in Modern Age Technologies for Health and Engineering Science (AMATHE); IEEE: New York, NY, USA, 2024; pp. 1–7. [Google Scholar]
Lemaignan, S.; Cooper, S.; Ros, R.; Ferrini, L.; Andriella, A.; Irisarri, A. Open-source natural language processing on the pal robotics ari social robot. In HRI ’23: Companion of the 2023 ACM/IEEE International Conference on Human–Robot Interaction; Association for Computing Machinery: New York, NY, USA, 2023; pp. 907–908. [Google Scholar]
Sikorski, P.; Yu, K.; Billadeau, L.; Esposito, F.; AliAkbarpour, H.; Babaias, M. Improving Robotic Arms Through Natural Language Processing, Computer Vision, and Edge Computing. In 2025 3rd International Conference on Mechatronics, Control and Robotics (ICMCR); IEEE: New York, NY, USA, 2025; pp. 35–41. [Google Scholar]
Soni, A.A. Improving Speech Recognition Accuracy Using Custom Language Models with the Vosk Toolkit. arXiv 2025, arXiv:2503.21025. [Google Scholar] [CrossRef]
Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing; IEEE: New York, NY, USA, 2018. [Google Scholar] [CrossRef]
Ravanelli, M.; Parcollet, T.; Plantinga, P.; Rouhe, A.; Cornell, S.; Lugosch, L.; Subakan, C.; Dawalatabad, N.; Heba, A.; Zhong, J.; et al. SpeechBrain: A General-Purpose Speech Toolkit. arXiv 2021, arXiv:2106.04624. [Google Scholar]
Snyder, D.; Garcia-Romero, D.; McCree, A.; Sell, G.; Povey, D.; Khudanpur, S. Spoken Language Recognition using X-vectors. In Proceedings of the Odyssey 2018: The Speaker and Language Recognition Workshop, Les Sables d’Olonne, France, 26–29 June 2018. [Google Scholar]
Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of the INTERSPEECH 2020: Conference of the International Speech Communication Association, Shanghai, China, 25–29 October 2020. [Google Scholar] [CrossRef]
Villalba, J.; Chen, N.; Snyder, D.; Garcia-Romero, D.; McCree, A.V.; Sell, G.; Borgstrom, J.; García-Perera, L.P.; Richardson, F.; Dehak, R.; et al. State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations. Comput. Speech Lang. 2020, 60, 101026. [Google Scholar] [CrossRef]
Zeinali, H.; Wang, S.; Silnova, A.; Matějka, P.; Plchot, O. BUT System Description to VoxCeleb Speaker Recognition Challenge. arXiv 2019, arXiv:1910.12592. [Google Scholar] [CrossRef]
Resemble AI. Public. Resemblyzer: A Python Package to Analyze and Compare Voices with Deep Learning. 2019. Available online: https://github.com/resemble-ai/Resemblyzer (accessed on 13 February 2026).
Resemble AI. Resemble AI: Generative Voice AI for Enterprise. 2024. Available online: https://www.resemble.ai/ (accessed on 13 February 2026).
Wan, L.; Wang, Q.; Papir, A.; Moreno, I.L. Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2018; pp. 4879–4883. [Google Scholar]
Zhang, T.; Ramakrishnan, R.; Livny, M. BIRCH: An efficient data clustering method for very large databases. ACM SIGMOD Rec. 1996, 25, 103–114. [Google Scholar] [CrossRef]
Ester, M.; Kriegel, H.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial Databases with Noise. In KDD ’96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining; AAAI Press: Washington, DC, USA, 1996. [Google Scholar]
Ester, M.; Kriegel, H.P.; Sander, J.; Wimmer, M.; Xu, X. Incremental Clustering for Mining in a Data Ware Housing; University of Munich Oettingenstr: Munich, Germany, 1998; Volume 67. [Google Scholar]
Cao, F.; Ester, M.; Qian, W.; Zhou, A. Density-Based Clustering over an Evolving Data Stream with Noise. In Proceedings of the 2006 SIAM International Conference on Data Mining (SDM); SIAM: Philadelphia, PA, USA, 2006. [Google Scholar] [CrossRef]
Hahsler, M.; Bolaos, M. Clustering Data Streams Based on Shared Density between Micro-Clusters. IEEE Trans. Knowl. Data Eng. 2016, 28, 1449–1461. [Google Scholar] [CrossRef]
Milligan, S.; Sales, G.; Khirnykh, K. Sound levels in rooms housing laboratory animals: An uncontrolled daily variable. Physiol. Behav. 1993, 53, 1067–1076. [Google Scholar] [CrossRef]
Liashchynskyi, P.; Liashchynskyi, P. Grid search, random search, genetic algorithm: A big comparison for NAS. arXiv 2019, arXiv:1912.06059. [Google Scholar] [CrossRef]
Berrar, D. Cross-validation. In Encyclopedia of Bioinformatics and Computational Biology; Elsevier: Amsterdam, The Netherlands, 2019. [Google Scholar]
Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
Steinley, D. Properties of the hubert-arable adjusted rand index. Psychol. Methods 2004, 9, 386–396. [Google Scholar] [CrossRef] [PubMed]
Vinh, N.X.; Epps, J.; Bailey, J. Information theoretic measures for clusterings comparison: Variants, Properties, Normalization and Correction for Chance. J. Mach. Learn. Res. 2010, 11, 2837–2854. [Google Scholar]
Rosenberg, A.; Hirschberg, J. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic, 28–30 June 2007. [Google Scholar]
Rousseeuw, P. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Cabezaolías, C.; de la Cruz Díaz, A.; Maroto-Gómez, M.; Castillo, J.C.; Salichs, M.Á. A Pet Robot Prototype for Animal-Assisted Therapy. In Advances in Practical Applications of Agents, Multi-Agent Systems, and Digital Twins: The PAAMS Collection; Springer: Cham, Switzerland, 2024; pp. 330–336. [Google Scholar]
Bo, V.; Garrell, A.; Sanfeliu, A. Fast or Accurate? How Intention-Recognition Models Shape Human Perception of a Mobile Robot. In HRI ’26: Companion Proceedings of the 21st ACM/IEEE International Conference on Human–Robot Interaction; Association for Computing Machinery: New York, NY, USA, 2026; pp. 502–506. [Google Scholar]
Waveren, S.V.; Carter, E.; Leite, I. Take One For the Team: The Effects of Error Severity in Collaborative Tasks with Social Robots. In IVA ’19: Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents; Association for Computing Machinery: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
Hancock, P.A.; Kessler, T.; Kaplan, A.D.; Brill, J.C.; Szalma, J.L. Evolving Trust in Robots: Specification Through Sequential and Comparative Meta-Analyses. Hum. Factors 2021, 63, 1196–1229. [Google Scholar] [CrossRef]
Campagna, G.; Rehm, M. A Systematic Review of Trust Assessments in Human–Robot Interaction. ACM Trans. Hum.-Robot. Interact. 2024, 14, 30. [Google Scholar] [CrossRef]

Figure 1. Proposed methodology flowchart.

Figure 2. Methodology example for loading the database for three users and assigning the cluster numbers to each user’s ID.

Figure 3. Proposed methodology flowchart example for known and unknown users. Purple boxes show the case of a known speaker, where the methodology needs only one embedding to recognise the user. Green boxes show the case of an unknown speaker, where the methodology needs three embeddings to recognise the user.

Figure 4. Flowchart describing the offline evaluation process, indicating the time metrics used to obtain the performance of the proposed approach.

Figure 5. The pet-like social robot used for the integration of this work.

Figure 6. Software architecture of the robot, highlighting the voice translator which includes speech recognition based on Vosk and speaker recognition based on voice biometrics extraction and clustering.

Figure 7. Time metrics description from online evaluation. Fragment of flowchart, shown in Figure 1.

Figure 8. Volunteers’ gender and age distribution (Vosk embeddings visualised with t-SNE).

Table 1. Advantages and limitations of state-of-the-art speaker recognition systems implemented on social robots.

Reference	Advantages	Limitations
Alonso-Martín and Salichs [19]	User verification and speech extraction	No standardised performance evaluation
Kozhirbayev et al. [17]	Speaker recognition using Neural Networks	Requires external online server
Kozhirbayev et al. [17]	and Mel-Frequency Cepstral Coefficients	and does not recognise unknown users
Tuasikal et al. [18]	User verification using Dynamic Time Warping	Requires external online server
Tuasikal et al. [18]	and Mel-Frequency Cepstral Coefficients	Requires external online server
Foggia et al. [20]	Known and unknown user recognition	Requires embedded GPU (NVIDIA Jetson)

Table 2. Hyperparameter search space for the evaluated clustering algorithms.

Algorithm	Hyperparameter	Values Range
BIRCH	Threshold (T)	1.0–25.5, with step 0.5
IncDBSCAN	Neighbourhood radius ( $ε$ )	0.1–1.0, with step 0.1
	Minimum points ( $M p$ )	1–10, step 1
	Metric	cosine
DenStream	Decaying factor ( $λ$ )	0.005, 0.01
	Weight threshold ( $β$ )	0.1–0.7, with step 0.1
	Minimum weight ( $μ$ )	7–13, with step 1
	Micro-cluster radius ( $ε$ )	0.3–0.8, with step 0.1
	Initial samples ( $n_{s a m p l e s}$ )	8–13, with step 1
	Stream speed (v)	3–7, with step 1
DBSTREAM	Clustering threshold (r)	0.5–7.0, with steps around 1.0
	Fading factor ( $λ$ )	0.01, 0.001
	Clean-up interval ( $t_{g a p}$ )	2–6, with step 1
	Intersection factor ( $α$ )	0.1–1.0, with step 0.1
	Minimum weight ( $w_{min}$ )	0.1, 0.1–5, with step 1

Table 3. Inference time, total processing time, and number of embeddings for each voice biometric extraction model.

Voice Biometric Extraction Model	Model Inference Time (s)	Model Total Time (s)	Number of Embeddings
SpeechBrain TDNN	0.094	45.321	480
Resemblyzer	0.145	69.790	480
Vosk	0.645	327.280	506
SpeechBrain ECAPA-TDNN	3.008	1479.945	480
SpeechBrain ResNet	28.448	13,654.823	480

Table 4. Hyperparameters for each voice extraction model and clustering algorithm combination.

Model	Selected Hyperparameters
BIRCH + Vosk	$T = 10.0$
BIRCH + SpeechBrain ECAPA-TDNN	$T = 11.5$
BIRCH + SpeechBrain ResNet	$T = 12$
BIRCH + SpeechBrain TDNN	$T = 17$
BIRCH + Resemblyzer	$T = 12.5$
IncDBSCAN + Vosk	$ε = 0.5$ , $M p = 2$ , cosine
IncDBSCAN + SpeechBrain ECAPA-TDNN	$ε = 0.4$ , $M p = 2$ , cosine
IncDBSCAN + SpeechBrain ResNet	$ε = 0.4$ , $M p = 3$ , cosine
IncDBSCAN + SpeechBrain TDNN	$ε = 0.4$ , $M p = 4$ , cosine
IncDBSCAN + Resemblyzer	$ε = 0.4$ , $M p = 3$ , cosine
DenStream + Vosk	$λ = 0.01$ , $β = 0.1$ , $μ = 13$ , $ε = 0.5$ , $n_{s a m p l e s} = 10$ , $v = 3$
DenStream + SpeechBrain embeddings	$λ = 0.01$ , $β = 0.1$ , $μ = 11$ , $ε = 0.3$ , $n_{s a m p l e s} = 8$ , $v = 3$
DenStream + Resemblyzer	$λ = 0.01$ , $β = 0.2$ , $μ = 10$ , $ε = 0.3$ , $n_{s a m p l e s} = 10$ , $v = 3$
DBSTREAM + All embeddings	$r = 1.0$ , $λ = 0.001$ , $t_{g a p} = 2$ , $α = 0.1$ , $w_{m i n} = 0.1$

Table 5. Global ranking of clustering configurations for hyperparameters optimisation. Best-performing values in each column are highlighted in bold.

Rank	Model	ARI	AMI	V-Measure	Silhouette	Clusters	Clustering Latency (s)
1	BIRCH + Vosk	0.793	0.936	0.957	0.294	40	0.064
2	IncDBSCAN + Resemblyzer	0.824	0.923	0.952	0.475	44	0.301
3	IncDBSCAN + Vosk	0.774	0.921	0.947	0.309	42	0.278
4	BIRCH + SpeechBrain ResNet	0.722	0.870	0.918	0.247	47	0.065
5	BIRCH + Resemblyzer	0.519	0.827	0.884	0.311	41	0.065
6	IncDBSCAN + SpeechBrain ResNet	0.515	0.807	0.876	0.276	43	0.301
7	IncDBSCAN + SpeechBrain TDNN	0.393	0.770	0.843	0.237	35	0.358
8	BIRCH + SpeechBrain TDNN	0.331	0.734	0.819	0.202	41	0.055
9	BIRCH + SpeechBrain ECAPA-TDNN	0.325	0.689	0.784	0.190	35	0.085
10	IncDBSCAN + SpeechBrain ECAPA-TDNN	0.320	0.717	0.800	0.148	34	0.293
11	DenStream + Vosk	0.317	0.715	0.763	0.176	14	1.067
12	DenStream + Resemblyzer	0.227	0.671	0.728	0.158	13	3.341
13	DenStream + SpeechBrain ResNet	0.113	0.513	0.568	0.116	8	1.555
14	DenStream + SpeechBrain ECAPA-TDNN	0.105	0.436	0.476	0.109	5	0.499
15	DenStream + SpeechBrain TDNN	0.069	0.386	0.420	0.145	4	0.825
16	DBSTREAM + Vosk	0.028	0.189	0.205	0.096	2	0.075
17	DBSTREAM + SpeechBrain ECAPA-TDNN	0.026	0.184	0.201	0.129	2	0.107
18	DBSTREAM + SpeechBrain TDNN	0.026	0.171	0.187	0.161	2	0.263
19	DBSTREAM + SpeechBrain ResNet	0.022	0.169	0.186	0.087	2	0.135
20	DBSTREAM + Resemblyzer	0.016	0.131	0.149	0.051	2	0.139

Table 6. Leave-one-out cross-validation average results for the top-6 models ranking them by success rate for new sample recognition.

Rank	Model	ARI	AMI	V-Measure	Silhouette	Clusters	Validation Inference Time (s)	Validation Total Time (s)	Success Rate (%)
1	IncDBSCAN + Vosk	0.774	0.920	0.946	0.332	36.02	0.281	148.305	91.30
2	BIRCH + Vosk	0.789	0.935	0.956	0.294	39.95	0.073	53.167	84.98
3	IncDBSCAN + Resemblyzer	0.823	0.924	0.953	0.475	45.01	0.295	148.298	81.04
4	IncDBSCAN + SpeechBrain ResNet	0.515	0.808	0.877	0.277	43.99	0.289	145.226	72.50
5	BIRCH + Resemblyzer	0.518	0.827	0.884	0.311	40.99	0.073	50.579	64.58
6	BIRCH + SpeechBrain ResNet	0.724	0.871	0.919	0.248	47.49	0.072	50.459	52.29

Table 7. User recognition results with known and unknown users.

User	Number of Interactions	Correct Detections	Incorrect Detections	Noise Detections	New Cluster Detections	New Clusters Created
Known 1	21	2 (9.52%)	1 (4.76%)	2 (9.52%)	16 (76.19%)	1
Known 2	26	24 (92.31%)	0 (0%)	2 (7.69%)	-	0
Known 3	20	11 (55%)	0 (0%)	9 (45%)	-	0
Known 4	26	25 (96.15%)	0 (0%)	1 (3.85%)	-	0
Known 5	23	23 (100%)	0 (0%)	0 (0%)	-	0
Total Known	116	85 (73.28%)	1 (0.86%)	14 (12.07%)	16 (13.79%)	1
Unknown 1	21	21 (100%)	0 (0%)	0 (0%)	-	1
Unknown 2	19	19 (100%)	0 (0%)	0 (0%)	-	1
Unknown 3	26	12 (46.15%)	0 (0%)	14 (53.85%)	-	1
Unknown 4	14	14 (100%)	0 (0%)	0 (0%)	-	1
Unknown 5	27	10 (37.04%)	0 (0%)	10 (37.04%)	7 (25.93%)	2
Total Unknown	107	76 (71.03%)	0 (0%)	24 (22.43%)	7 (6.54%)	6
Total	223	161 (72.20%)	1 (0.45%)	38 (17.04%)	23 (10.31%)	7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Segura-Bencomo, A.; Maroto-Gómez, M.; Gamboa-Montero, J.J.; Castillo, J.C. A User Recognition Methodology Based on Voice Biometrics and Dynamic Clustering for Social Robots. Appl. Sci. 2026, 16, 4548. https://doi.org/10.3390/app16094548

AMA Style

Segura-Bencomo A, Maroto-Gómez M, Gamboa-Montero JJ, Castillo JC. A User Recognition Methodology Based on Voice Biometrics and Dynamic Clustering for Social Robots. Applied Sciences. 2026; 16(9):4548. https://doi.org/10.3390/app16094548

Chicago/Turabian Style

Segura-Bencomo, Arecia, Marcos Maroto-Gómez, Juan José Gamboa-Montero, and José Carlos Castillo. 2026. "A User Recognition Methodology Based on Voice Biometrics and Dynamic Clustering for Social Robots" Applied Sciences 16, no. 9: 4548. https://doi.org/10.3390/app16094548

APA Style

Segura-Bencomo, A., Maroto-Gómez, M., Gamboa-Montero, J. J., & Castillo, J. C. (2026). A User Recognition Methodology Based on Voice Biometrics and Dynamic Clustering for Social Robots. Applied Sciences, 16(9), 4548. https://doi.org/10.3390/app16094548

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A User Recognition Methodology Based on Voice Biometrics and Dynamic Clustering for Social Robots

Abstract

1. Introduction

2. Related Work

2.1. Speaker Recognition Methods

2.2. Speaker Recognition in Social Robotics

3. User Recognition from Voice Biometrics

3.1. Materials and Tools

3.1.1. Voice Biometrics Extraction

3.1.2. Clustering Algorithms

3.2. Proposed Methodology

4. Methodology Evaluation

4.1. Participants for Offline Evaluation

4.2. Procedure for Offline Evaluation

4.3. Metrics for Offline Evaluation

4.4. Results of Offline Evaluation

4.4.1. Embedding Extraction Time Performance

4.4.2. Hyperparameter Optimisation for Each Embedding–Clustering Combination

4.4.3. Evaluation of New Samples Detection

5. Integration in a Pet-like Social Robot

5.1. Our Pet-like Social Robot

5.2. Participants for Online Evaluation

5.3. Procedure for Online Evaluation

5.4. Metrics for Online Evaluation

5.5. Results of Online Evaluation

6. Discussion

Limitations

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Phrases Used for Dataset Recording

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI