Personalization of Robot Behavior Using Approach Based on Model Predictive Control

Jarosz, Mateusz; Sniezynski, Bartlomiej

doi:10.3390/app142411805

Open AccessArticle

Personalization of Robot Behavior Using Approach Based on Model Predictive Control

by

Mateusz Jarosz

^*,†

and

Bartlomiej Sniezynski

^*,†

Institute of Computer Science, Faculty of Computer Science, AGH University of Krakow, al. A. Mickiewicza 30, 30-059 Krakow, Poland

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(24), 11805; https://doi.org/10.3390/app142411805

Submission received: 1 November 2024 / Revised: 8 December 2024 / Accepted: 12 December 2024 / Published: 17 December 2024

(This article belongs to the Special Issue Autonomous Mobile Robotics: Latest Advances and Prospects)

Download

Browse Figures

Versions Notes

Abstract

This paper proposes a novel approach to personalizing robot behavior using Model Predictive Control (MPC). Social humanoid robots, equipped with advanced sensors and human-like capabilities, are increasingly integrated into human environments, necessitating adaptable and intuitive communication interfaces. Our approach enables the design of adaptive interfaces that support fluid, personalized human–robot interactions. In the proposed framework, a user model is applied to predict responses to potential robot actions. Initially, this model represents an average user; however, it is updated as the robot gathers new observations, leading to increasingly personalized decisions. Experiments assessed the performance of five machine-learning algorithms generating user models in a simulated environment, with the Light Gradient Boosting Machine (LGBM) achieving the best results, closely followed by Random Forest (RF). A comparison of the inference time showed that LGBM is more than four times faster than RF. Outlier-removal techniques showed a modest performance improvement over models without outlier removal. Additionally, robot adaptation was tested in the experiments, showing an increase in the average reward over time, although with a relatively high standard deviation. The results suggest that the proposed approach for robot behavior adaptation based on MPC works well, and the recommended algorithm for the user model is LGBM.

Keywords:

human–robot interaction; robot control platform; robot simulation; robot behaviors; classification

1. Introduction

Social humanoid robots are becoming more prevalent, with their abilities advancing quickly. They are outfitted with various systems that allow them to perform human-like actions naturally, along with sensors that enable them to gather information about their surroundings, including other people. Some of these robots are fully autonomous and are effectively replacing humans in many situations, acting as a museum tour guide [1], a personal companion [2], a sales assistant [3], and so on. For such robots, it is crucial to design human–robot interfaces that are not only intuitive for the human user but also capable of adapting to their personal needs, preferences, and habits. A well-designed interface should facilitate seamless communication and interaction, minimizing the learning curve for users of different ages, backgrounds, and technological familiarity. This adaptability ensures that the robot can tailor its behavior and responses to suit individual preferences, providing a more personalized experience. Additionally, the interface should allow for continuous learning, enabling the robot to evolve alongside the user, recognizing patterns in behavior, adjusting to changing needs, and enhancing user comfort and engagement over time. An effective interface fosters trust and promotes long-term user satisfaction by creating a more natural, meaningful human–robot relationship.

Two common channels through which humans communicate are speech and gaze [4]. In this research, our long-term goal is to design intuitive interfaces for human–robot communication, allowing for robot adaptation.

The main contributions of this research are proposing a new approach to the personalization of robot behavior using the Model Predictive Control approach, performing experiments in a simulated environment, and rule-based data generation for the verification of the proposed solution.

This article is structured as follows: Section 2 discusses related work. Section 3 contains a high-level description and implementation details of our proposed solution. In Section 4, we describe the evaluation experiments and their results. Concluding remarks and directions for further research are presented in Section 5.

2. Related Research

In [5], Noriaki Mitsunaga et al. proposed using policy gradient reinforcement learning (PGRL) to create a system for allowing adaptation of robot behavior in a human–robot interaction scenario. They propose using the subconscious body signals of humans as the main data for calculating pleasure or discomfort from interacting with a robot. They use gaze contact time, and the distance between human and robot to calculate the reward function and adjust six parameters defining robot behavior, including gaze time of the robot, gesture speed, and waiting time. In our study, we also use gaze data and adapt gestures, motion speed, and waiting time, as well as using and adjusting other parameters, e.g., the emotional state of the user and the volume of utterances.

Collecting data that will be used in the training of a system controlling robot behaviors that can adapt to changing situations is a very demanding and time-consuming task, according to Antonio Andriella et al. [6]. This task is not only demanding from a robot and experimenter perspective but more so from a user perspective, mainly due to the sheer number of data samples that need to be collected for one user to perform successful adaptation without previous knowledge. As a solution to this problem, Antonio Andriella et al. proposed using a persona–behavior simulator to create a base model, and user input is used only to fine-tune the model used for adaptation. In our work, we use a similar approach to create the base model in a simulated environment. The main difference is in the type of model used in the system and the complexity of the simulation.

One of the areas of human–robot interaction that can make great use of task adaptation and personalization is human–robot collaboration. Olivier Mangin et al. [7] explored this in a scenario of a collaborative furniture construction task. They proposed using Partially Observable Markov Decision Processes (POMDP) to decide which high-level task to execute, and then the Monte Carlo solver was employed for the short-term execution of the plan for a robot. The authors used a high-level hierarchical task model (HTM), similar to our work, then high-level abstract tasks, e.g., “assemble leg”, were divided into a more concrete series of subtasks using POMDP, e.g., “hold, pick up leg, get screw, …”. A single task transformed in this manner is called a restricted model. Restricted models (RM) are recombined into HTM, where the interaction between them is minimal. The benefit of such an approach is the possibility to focus on short-term planning and analyzing RM separately. Also, it is less computationally demanding, as calculating global policy at a high level is complex compared to calculating short-term policies for simple tasks. Adaptation of the robot’s behavior is achieved based on user feedback collected during the collaboration. The results presented in the paper suggest that the proposed approach produces excellent results.

Eshed OhnBar et al. [8] studied the personalization in adaptive navigation systems for blind people. They emphasized the importance of tailoring such a system to the user’s personal needs to minimize navigation errors and user confusion. The authors propose using a weighted expert model to handle the limited amount of data collected from user interactions and compare it to a standard single model fine-tuning with the obtained results. Models are trained on user position and heading data extracted from user movements and Bluetooth beacon localization. They are used to predict navigation instructions appropriate for the user at that moment. Especially in turn-based actions, better results are achieved much faster. As with our work, they use a simulated environment based on real-world data to test their approach.

The authors of [9] propose an interesting approach to a human–robot collaboration in the task of helping disabled humans dress. They propose a system that gives commands to the robot and user on what to perform next to achieve the goal. The proposed solution is adapting to the user’s physical capabilities and body size and shape. The system works in an offline mode. First, data about user capabilities and needs must be collected. Then, the system creates personalized plans using a simulated environment. Lastly, during the second interaction, the robot can take advantage of previous interactions and adapt to the user, apply one of the plans, and perform tasks better, thanks to adaptation. The simulated environment is quite detailed as it handles cloth physics in contrast to our simulation, which uses a simplified approach.

Alessandro Umbrico et al. [10] emphasize a need to adapt and personalize behaviors according to the context and needs of a person in the task of social assistive robotics (SAR). This approach can improve the effectiveness of user support and, therefore, user acceptance. The authors propose a system that can work in a continuous loop in the SAR task using a holistic ontology of what the robot knows and can do. The main focus of the authors is on the representation of the world. Sensor data are converted using a rule-based system to a semantic form and combined with robot self-knowledge (e.g., information on which sensors were available) in the Event and Activity Detection step. User adaptation is based on the International Classification of Functioning, Disability, and Health (ICF) as it defines semantics for representing the physical and cognitive capabilities of a person. These features, together with activities and participation, create a user model that is updated during interaction with the robot. The user model is used in the robot decision process to select appropriate actions. In our work, we also apply the rule-based system, which results in a good interpretability of the approach.

3. Proposed Solution

In our previous work [11], we proposed a human–robot interaction framework that can work on multiple platforms and streamline the creation of interaction scenarios. The main focus was on perception modules and framework architecture. In this work, we will extend the adaptive decision module to provide behavior personalization. A diagram presenting our framework architecture is shown in Figure 1. It is written in Python language. Perception components are responsible for processing raw data obtained from the sensors, giving a higher-level perception description. The face-tracking component processes raw video from the camera, taking into account the robot’s joint position, and moves the robot to follow the detected user face.

The Gaze-Detection Direction module extracts the user’s gaze direction from the video and combines it with the robot’s joint positions. The module returns angle values representing the gaze direction in the camera’s plane of perspective. The Question-Detection component converts the audio signal to spectrograms and, using a deep neural network, detects when a question is asked. The Emotion component identifies the user’s face in the camera image and classifies the expressed emotion using a deep neural network. The Speech-Recognition module transcribes the audio input into text. We use Google Speech-to-Text for this functionality to achieve the best results. The Behavior Planner reads Activity Scripts and, using the perception description and the Adaptive Decision Module, selects and executes Actions, which represent basic or complex activities performed by the robot according to the scenario outlined in one of the Activity Scripts.

Our solution is based on the Model Predictive Control (MPC) method [12]. MPC is an advanced control strategy that uses a mathematical model of a system to predict and optimize its future behavior over a defined time horizon. At each step, MPC solves an optimization problem by considering the current system conditions and constraints to determine the best sequence of control actions. Then, it applies the first action from the sequence, recalculating the next optimal sequence in subsequent steps as new data become available. This process allows MPC to handle multivariable systems with constraints on inputs and outputs, making it highly effective for complex, dynamic environments.

In our solution, the MPC approach is applied in the decision module. Figure 2 presents its architecture. In Figure 3 we can see a flow chart showing how this module works internally. The User model is a key element used to adapt to the personality of the user. The User model is fed with input Data from the available sensors. Then, it predicts the user reaction to all possible robot actions, and the best action is picked to execute by the optimization component. Selection is based on cost function, which currently combines two factors: maximizing the user’s positive emotions and increasing the user’s focus calculated in a given optimization time horizon.

MPC has several limitations. The first is computational complexity. It is necessary to execute the User model many times to choose the best decision, especially when the time horizon value is high. However, in our approach, MPC is not applied to a real-time problem. The Adaptive Decision Module controls high-level functionalities. Therefore, delays are more acceptable. Another problem is the limited resources on the robot platform. Our platform allows for distributed computation; therefore, the solution is to run highly demanding computations on a server. The final issue is cost-function parameter tuning. We have tuned it manually, which is described below, but algorithmic approaches can also be applied (e.g., using Gaussian processes or the genetic algorithm [13,14]).

In our experiments, we apply a time horizon equal to one step for user reaction prediction and for optimization. As a result, the action selection is greedy, as only immediate reaction to the action is predicted (without delayed effects), and the action providing the lowest cost in the next step is chosen. However, the algorithm is general and allows us to extend the time horizon, which would allow the robot to find more sophisticated action plans. Also, the cost function can be more complex and include energy consumption by a given action, duration of the action, etc.

To provide adaptability of robot behavior, this architecture is extended as follows. The state after the best-action execution is observed using sensors, and the collected data are used as User feedback. These results are evaluated using the reward function. They are used to improve and reevaluate the User model. The complete decision process, including adaptation, is presented in Algorithm 1. Main_loop procedure is the primary function in the decision module that initiates and calls other functions. Here, we present a simplified and abstract version of it. In reality, our framework operates in parallel, is partially distributed, and follows an event-driven paradigm, making it challenging to represent as a single function. The user_model parameter is an object representing the User model, while list_of_interactions is a list of planned user–robot interactions (the subsequent stages of the conversation between the robot and the user in our case). For more details, please refer to [11]. The list_of_robot_actions contains all possible robot actions. adaptive_decision function uses User model to predict user reactions to all possible robot actions in a given state and selects the best action for execution. get_best_action function chooses the optimal action from the robot action list based on the user reactions predicted. The apply_user_feedback function calculates and returns a reward function based on the sensor data and the performed action, using Equation (1).

Depending on the model used, re-evaluation can be executed incrementally or in all-at-once batch mode. Incremental algorithms, like reinforcement learning, can learn continuously, using new data only, and do not require retraining to update the model. Algorithms without such capabilities, such as Random Forest, require training on the whole training data (base dataset combined with data from last experiments with the user) to improve the model each time new data appear.

Algorithm 1 Adaptive decision module algorithm

procedure Main_loop( $u s e r_m o d e l, l i s t_o f_i n t e r a c t i o n s, l i s t_o f_r o b o t_a c t i o n s$ )
for each $i t e r a c t i o n \in l i s t_o f_i n t e r a c t i o n s$ do
$R o b o t . e x e c u t e_i n t e r a c t i o n (i n t e r a c t i o n)$
$s e n s o r_d a t a \leftarrow g e t_s e n s o r_d a t a ()$
$r o b o t_a c t i o n \leftarrow ADAPTIVE_DECISION (u s e r_m o d e l, s e n s o r_d a t a, l i s t_o f_r o b o t_a c t i o n s)$
$R o b o t . e x e c u t e_a c t i o n (r o b o t_a c t i o n)$
$r e w a r d \leftarrow APPLY_USER_FEEDBACK (u s e r_m o d e l, r o b o t_a c t i o n, s e n s o r_d a t a)$
function adaptive_decision( $u s e r_m o d e l, s e n s o r_d a t a, l i s t_o f_r o b o t_a c t i o n s$ )
$u s e r_r e a c t i o n s \leftarrow {}$
for each $r o b o t_a c t i o n \in l i s t_o f_r o b o t_a c t i o n s$ do
$r e a c t i o n \leftarrow u s e r_m o d e l . p r e d i c t (s e n s o r_d a t a, r o b o t_a c t i o n)$
$u s e r_r e a c t i o n s . a d d (r e a c t i o n)$
$b e s t_a c t i o n \leftarrow GET_BEST_ACTION (l i s t_o f_r o b o t_a c t i o n s, u s e r_r e a c t i o n s)$
return $b e s t_a c t i o n$
function get_best_action( $l i s t_o f_r o b o t_a c t i o n s, u s e r_r e a c t i o n s$ )
$b e s t_a c t i o n_i n d e x \leftarrow a r g m a x (u s e r_r e a c t i o n s$ )
return $l i s t_o f_r o b o t_a c t i o n s [b e s t_a c t i o n_i n d e x]$
function apply_user_feedback( $u s e r_m o d e l, p e r f o r m e d_a c t i o n, s e n s o r_d a t a$ )
$r e w a r d \leftarrow g e t_r e w a r d (p e r f o r m e d_a c t i o n, s e n s o r_d a t a$ )
$u s e r_m o d e l . u p d a t e (p e r f o r m e d_a c t i o n, s e n s o r_d a t a, r e w a r d)$
return $r e w a r d$

In this approach, the personalization of robot behavior is achieved by developing an appropriate model. There are two ways to accomplish this: by enhancing the model with additional training data to improve its performance and by refining the reward function. By collecting more real data from the user and retraining the user model, the model becomes better adapted to the user’s behavior. The reward function, which initially represents a theoretical baseline user, can also be adjusted to reflect the preferences of the specific user as more data is collected, improving personalization. Such an adjustment would require manually changing the values of the emotions and gaze factors or using some learning algorithm; see Equation (1).

The reward function

Q (A)

assigns a real value to the change in the user’s emotional state and the user’s gaze pattern after the robot’s action. It corresponds to the MPC goal function. The adaptive decision module selects an action with the highest predicted value, so the reward function should return a value that represents the quality of the action in the current situation. Its definition depends on the application and the available sensors. In our research, it is calculated as a linear composition of recognized user emotions and parameters describing gaze contact between the user and the robot:

\begin{matrix} Q_{e} (A) = \sum_{x \in X} ((e_{x}^{'} - e_{x}) * F_{x}), X = {a, d, f, h, s, s u, n} \\ Q_{g} (A) = \sum_{n \in N} ((g_{n}^{'} - g_{n}) * F_{n}), N = {c, d, u, l, r, d i} \\ Q (A) = Q_{e} (A) + Q_{g} (A) . \end{matrix}

(1)

These functions allow the analysis of the change in the observed behavior of a user after the execution of an action A.

Q_{e} (A)

represents emotion quality. It is a sum of the differences between the emotions after (

e_{x}^{'}

) and before (

e_{x}

) execution of the action multiplied by a factor

F_{x}

.

The factor values

F_{x}

were manually chosen to closely align with a rule-based expert system that represents the user in the simulated environment. Specifically, they were selected to give a high-quality score for actions preferred by a user simulated by the expert system and a low score for actions that differ significantly from the user’s preferences. The most distinct examples of differing actions would be opposites, such as increasing volume versus decreasing volume.

Emotion x is one of the following emotions: angry (a), disgust (d), fear (f), happy (h), sad (s), surprise (su), and neutral (n). The values for

F_{x}

are {−6.75, −5.5, −10.0, 5.5, −2.5, 1.5, 4.0}, respectively. They are designed to penalize negative changes and encourage positive or neutral ones.

Similarly, we calculate

Q_{g} (A)

representing the quality of the human gaze. The following gaze directions were used in the calculations: contact with the robot (c), down (d), up (u), left (l), right (r), and diagonal (d). The values of

F_{n}

are also manually chosen and are equal to {4.0, −1.0, −1.75, 0.5, −3.625, −0.25}, respectively, to promote engagement understood as gaze contact. The reward function

Q (A)

is a sum of emotion and gaze quality.

One might ask why create a new uncertain function instead of using an already existing function for data generation. The biggest advantage of such an approach is the ease of interpretation of how such a function works, the low computational complexity, and the simple application for future, real-time experiments.

4. Evaluation

Personalizing a robot’s behaviors in human–robot interaction scenarios is a complicated task due to many reasons. The biggest one is that there is no readily available dataset to train the model for such a task. Therefore, in order to realize the main goal of our research, such a dataset must be created.

4.1. Dataset

To create a dataset, there are two possible approaches. The first is to perform experiments with a robot and a human interacting with each other, with a random user model to collect the best data. Using a random model would allow us to explore a bigger space of reactions than using no model or an arbitrarily created model. In our previous research [11], in real experiments with the robot, we used a fake job-interview scenario—a simple interaction with a dozen questions asked by the robot. In this experiment, human behavior while answering a question is analyzed, and after that, robot behavior can be modified accordingly to a personalized model. Thus, from every single experiment, we can obtain a dozen data samples. The problem is that most machine-learning algorithms require hundreds or thousands of data samples to train and evaluate. Hence, a huge number of real experiments would need to be conducted to gather enough data. Due to a lack of available time and space and a lack of participants, such an approach was not feasible.

The other option is to collect data in a simulated environment, as conducted in [6]. This approach was taken in this study. It should be noted that simulated data come with its own set of drawbacks. First and foremost, the data are less certain than those from a real-world experiment. Also, to make such an approach feasible, some simplifications must be made. The biggest one is that instead of simulating the whole robot, the human, and its environment, only data collected from such an interaction were simulated. In addition, instead of personalizing for a given person, the behavior was adopted for one of the personality traits—talkativeness. The reason for this was to simplify the learning process. The data sample describing the interaction between the user and the robot consists of 51 features; see Table 1.

Data are generated in a raw form, which is the same as real data collected in experiments with a physical robot. Later, it is converted to a more usable form. The data sample consists of three parts: data collected from the user by the robot describing a given state, the action chosen by the robot in a given state, and data collected after action execution.

The first value in the data sample time_from_last_question is the time from the most recent question asked by the user. It is scaled from 0 to 1 and resets after asking a question or the next round of conversation. number_of_question_asked is the number of questions asked in the current round of the experiment, time_from_last_event is a time from the last event that occurred in the current round. It is also scaled from 0 to 1. silence_detected binary represents the information if the user was silent for long enough to finalize the current round. user_speech_time is the duration during which the user was speaking. It is scaled from 0 to 1. emotion_X represents the percentage of emotion X observed in the user in the current round, calculated from the top 3 emotions consequently observed in the round. The human gaze is represented by a set of features of the form gaze_looking_on_X, which represent the fact that the user is looking at a direction X. It is a percentage of all the gaze directions observed by the user. robot_gaze_strategy_X is a one-hot encoded robot gaze strategy (see [11]), robot_state_X represents a one-hot encoded robot state, and triggered_events_X is a list of events in the current round.

For every data sample, there is also a label that is generated by the expert system representing the user, which describes the quality of the user’s reaction to the robot’s action: good, bad, or neutral. There are 14 possible robot actions: nothing, make louder, make quieter, gesture normal, gesture calm, gesture vivid, speak faster, speak slower, pitch higher, pitch lower, robot movements on, robot movements off, robot gaze on, and robot gaze off.

The main part of the simulator is a rule-based expert system that analyses input data from the user and robot reaction and then evaluates the quality of that reaction using rules stored in its knowledge base. The system uses 70 rules to predict the response of the user. The rule system is created in Python using the durable rules library (see [15]). Rules have various complexity, from simple ones, like:

if time_from_last_event < 0.2
then emit fact ’event_recently’

to more complex ones, like:

all(’emotion’ & ’happy’ | ’surprise’ | ’neutral’,
’emotion’ & ’sad’ | ’fear’ |’angry’ | ’neutral’,
none(’has’) & ’positive_emotion’|’negative_emotion’ | ’neutral_emotion’)
then emit fact ’mixed_emotion’

Rules have priority of importance and are ordered hierarchically from simple, working on raw data, to more complex and abstract, working on previously derived facts. The emit fact function creates a new fact and adds it to the knowledge base on which rules are calculated. The fact represents data and is represented as JSON and stored until retracted. The all function is true if all conditions written after a comma are true. The “|” sign stands for logical OR combining conditions, while the “&” sign is logical AND. The none function is true if the fact passed as an argument is not present in the knowledge base.

The other part of the simulator is a generator that generates actual data based on parameters derived from our previous experiments [11,16]. The gaze data collected during these experiments were converted into Markov Decision Models. The emotion model was statistically derived from emotion data, user personality, and the quality of the robot’s actions. The simulator operates in a 12-round simulation cycle to mimic real interactions with the same individual, applying gaze and emotion models to generate the next state based on the previous one. As a result, each set of 12 data samples, corresponding to consecutive questions, is interrelated. To generate the next data sample, the latter part of the previous sample is used, with this process applied to every sample except the first and the 12th.

In Figure 4, we can see a simulation flow chart with X number of rounds, which corresponds to the for-each loop in function Simulate in Algorithm 2. Get_state blocks use generator described in the previous paragraph, get_action follows adaptive decision algorithm, with random user model used in base dataset creation to explore the environment, and trained model in evaluation and adaptation scenarios. User feedback is created by the rule-based expert system described at the beginning of this section, which substitutes a real user in the simulated environment. Based on the generated user feedback, the reward function value is calculated, and the state is updated. Finally, the results are saved in the results file.

Algorithm 2 Data set generation algorithm

procedure main( $f i l e, n u m_o f_s i m s, p e r s o n a l i t y$ )
$n u m_o f_r o u n d s \leftarrow 12$
$r e s u l t s \leftarrow {}$
for each $s i m \in r a n g e (n u m_o f_s i m s)$ do
$r e s u l t \leftarrow SIMULATE (n u m_o f_r o u n d s, p e r s o n a l i t y)$
$r e s u l t s . a d d (r e s u l t)$
$f i l e . s a v e (r e s u l t s)$
function simulate( $n u m_o f_r o u n d s, p e r s o n a l i t y$ )
$r o b o t_g a z e_s t r a t e g y \leftarrow r a n d o m i z e_r o b o t_g a z e_s t r a t e g y ()$
$p e r s o n a l i t y_m o d e l \leftarrow P e r s o n a l i t y_m o d e l (p e r s o n a l i t y)$
$r e s u l t s \leftarrow {}$
$s t a t e \leftarrow g e t_s t a t e (r o b o t_g a z e_s t r a t e g y, p e r s o n a l i t y_m o d e l)$
for each $i \in r a n g e (n u m_o f_r o u n d s)$ do
$a c t i o n = g e t_a c t i o n (s t a t e, u s e r_m o d e l)$
$r e s p o n s e = g e t_r e s p o n s e (s t a t e, a c t i o n, p e r s o n a l i t y_m o d e l)$
$s c o r e \leftarrow g e t_r e w a r d (s t a t e, a c t i o n, r e s p o n s e)$
$r e s u l t s . a d d (s t a t e, a c t i o n, r e s p o n s e, s c o r e)$
$s t a t e \leftarrow u p d a t e_s t a t e (s t a t e, a c t i o n, r e s p o n s e)$
return $r e s u l t s$

The final dataset consists of 3000 simulated experiments, i.e., 36,000 data samples. The dataset generation algorithm is presented as Algorithm 2. The main procedure takes arguments as follows: file is a handle to a file where the dataset will be saved, num_of_sims is a number of simulated interactions to perform, personality is a personality type to simulate e.g., Talkative. The main procedure is responsible for running n simulations, collecting and saving their results to a file. The simulate function takes two arguments: num_of_rounds is a number of simulated interactions to simulate in human–robot interaction, one round can be, for example, communication: a robot asks a question and a user answers. The randomize_robot_gaze_strategy is a simple function that returns the robot gaze strategy randomly selected from human, stare, random. The personality_model is a representation of a user with a given personality.

4.2. Comparison of Machine-Learning Algorithms Used for Training User Models

In this work, we compared several algorithms that can be applied to train a user model that will be used to predict user reaction, which allows the choice of the right action for the robot in a given situation in the MPC paradigm. We decided to test the following five algorithms (with parameters given in parentheses): Decision Tree (gini split criterion), Random Forest (gini split criterion, num_estimators = 500, max_fetures = 0.4), ada boost(num_estimators = 1000, lr = 0.4), Light Gradient Boosting Machine (LGBM) (subsample = 1.0, num_estimators = 500, lr = 0.4, num_leaves = 80), KNN (n = 5) and two baseline algorithms, Random and One Rule, which select the reaction by random or using the one rule model learned from training data.

To clear training data by removing outliers, the following three algorithms were applied: Isolation forest (IF) outliers fraction set according to [17], local outlier factor (LOF) outliers fraction set according to [18], one-class SVM (OCS) with 0.01 outliers.

The results of the user models testing performed on a test dataset, containing 25% of the cases of the whole dataset, are presented in Table 2. As we can see, the best results were achieved by LGBM and Random Forest algorithms. All outlier removing algorithms achieved similar results—see Table 3, Table 4 and Table 5.

To evaluate each machine-learning algorithm used to train the model, 300 simulated experiments were performed for each algorithm. Each experiment has 12 rounds. For the best action, we calculated the value of the quality function and then summed it for each experiment. The results shown in Figure 5 represent the performance of the base models, which are used in the framework to predict user behavior, as shown in Figure 3 and Algorithm 1. The charts show the evaluation results in the form of box plots, where the median is marked in the center of each box, and the box extends from the first quartile to the third quartile of the data. The best-performing model (LGBM) is the same as during the learning phase; see Table 2 and Figure 5.

As we can see, using outlier removal during the learning stage provides minimal or no improvement for the best algorithm LGBM (compare Table 2, Table 3, Table 4 and Table 5). During evaluation, models trained on the cleared data performed slightly better, but the difference was not statistically significant. The Random Forest and LGBM algorithms scored the best results in the learning and evaluation phase, as they are more complex, have a larger capacity compared to other algorithms, and can better model the user in this task. The difference between the performance of Random Forest and LGBM algorithms compared to other tested algorithms is statistically significant.

The time complexity of the best models (as listed in Table 2) tested in the evaluation scenario is presented in Figure 6. Each box plot displays the results of 3600 simulated interactions using the assigned model. The total execution time of the adaptive_decision function is measured. As expected, the simplest model (random) has the shortest execution time, while adaBoost has the longest. The LGBM model performs comparably to the fastest models while achieving the highest accuracy. In contrast, the Random Forest model is, on average, four times slower than the LGBM model, although it ranks second in terms of accuracy.

4.3. Adaptation

For experiments checking if robot behavior adaptation increases reward function values over time, the two best-performing algorithms were chosen: Random Forest and LGBM. The user models were retrained after every simulated conversation with a user, using new data samples and bootstrapped samples from the base dataset. Training data were weighted to balance size differences in datasets: new samples with a weight of 1 and base ones with 0.05. The simulation consisting of 50 talks was repeated 30×. The results are presented in Figure 7. As we can see, the Random Forest algorithm increased the reward from 0.5 to almost 1.0 after 15 talks. LGBM algorithm achieved a similar result after 25 talks. In experiments, we have also tested another approach to adaptation—to vote for the final result. A new model was added to this ensemble after every interaction. The experimental results were worse than the base model.

4.4. Discussion

In the test data (Table 2) and in the evaluation (Figure 5), the best performance was achieved by the LGBM algorithm. The second one was Random Forest. However, in the robot domain, the time complexity of algorithms should also be taken into account. LGBM is also faster than Random Forest. This can be explained by comparing the sizes of the models. The LGBM model is more than 10× smaller than the Random Forest model, mainly due to the number of trees (500) in the Random Forest model, which was selected as the optimal configuration for performance. The Decision Tree algorithm takes third place according to the F1 score. It is very fast and has a low execution time variance and a small number of outliers. Therefore, it could be used in time-demanding settings. It is especially important in MPC, because the model is applied many times. The reason is the simplicity of the Decision Tree model. The performance of the KNN model, which is also very simple, is unstable, with a high number of longer execution times despite maintaining a low median.

In addition, performance differences and hardware issues can arise between simulations and real-world experiments. To minimize these discrepancies, we use the same software framework (as shown in Figure 1) and the same servers used in real-world experiments with the Pepper robot.

Finally, to resolve uncertainties about the validity of simulations, the most reliable approach is to conduct numerous experiments with real users.

So far, we have run very few experiments on the real robot using the framework designed and user model based on LGBM trained offline. The proposed adaptation of robot behavior based on the MPC approach worked well.

5. Conclusions

In this study, we proposed an approach to the adaptation of robot behaviors to human actions based on Model Predictive Control. We also showed the possibility of using a simulated environment to create models for human–robot interaction and its advantages over performing real-world experiments. In experiments, the performance of five machine-learning algorithms for the user model (Decision Tree, Random Forest, ada boost, LGBM, and KNN) was evaluated in the simulated environment, showing an increase in average reward over time. The LGBM algorithm scored the best results, with the Random Forest trailing not far behind.The IF and LOF outlier-removal algorithms show a small improvement in simulated evaluation compared to the absence of outlier removal. We have also performed experiments to test the time complexity of selected models. LGBM was more than four times faster than Random Forest. Therefore, it is the best choice in the set of algorithms analyzed. The Decision Tree can also be considered a good choice in time-demanding cases. It has a little smaller F1-score than LGBM, but its execution time has a smaller variance and fewer outliers. Another advantage of the Decision Tree is its high explainability, which can also be useful.

All experiments were performed using prediction and control horizons equal to one, which is a limitation of the current implementation. Therefore, the search process is greedy. However, the approach proposed is general and allows for larger horizon values, which may be useful in complex tasks.

The results of experiments conducted in both a simulation environment and a real robot with a pretrained model confirmed that the proposed approach for the adaptation of robot behavior based on MPC performs effectively.

In our next study, we are going to collect additional data by conducting experiments with a real robot to statistically validate whether this approach can be effectively applied in real-world environments and to confirm our simulation-based results. The number of tests conducted so far has been insufficient for a comprehensive statistical analysis. We also plan to compare the proposed approach with reinforcement learning algorithms for this task. Preliminary studies have shown promising results. Another potential direction for future research is to conduct experiments with time horizon values greater than one to evaluate their impact on the robot’s performance.

Author Contributions

Conceptualization, B.S. and M.J.; Software M.J.; Writing—original draft, M.J.; Writing—review and editing, M.J. and B.S.; Supervision, B.S. All authors have read and agreed to the published version of the manuscript.

Funding

The research presented in this paper received support from the funds assigned by the Polish Ministry of Science and Technology to AGH University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kuno, Y.; Sadazuka, K.; Kawashima, M.; Yamazaki, K.; Yamazaki, A.; Kuzuoka, H. Museum guide robot based on sociological interaction analysis. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, San Jose, CA, USA, 28 April–3 May 2007; pp. 1191–1194. [Google Scholar]
Cooper, S.; Di Fava, A.; Vivas, C.; Marchionni, L.; Ferro, F. ARI: The social assistive robot and companion. In Proceedings of the 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Naples, Italy, 31 August–4 September 2020; IEEE: New York, NY, USA, 2020; pp. 745–751. [Google Scholar]
Gnewuch, U.; Hanschmann, L.; Kaiser, C.; Schallner, R.; Mädche, A. Robot Shopping Assistants: How Emotional versus Rational Robot Designs Affect Consumer Trust and Purchase Decisions. In Proceedings of the ECIS 2024, Paphos, Cyprus, 13–19 June 2024. [Google Scholar]
DeVito, J.A. Human Communication: The Basic Course; Pearson: London, UK, 2018. [Google Scholar]
Mitsunaga, N.; Smith, C.; Kanda, T.; Ishiguro, H.; Hagita, N. Adapting robot behavior for human–robot interaction. IEEE Trans. Robot. 2008, 24, 911–916. [Google Scholar] [CrossRef]
Andriella, A.; Torras, C.; Alenyà, G. Learning robot policies using a high-level abstraction persona-behaviour simulator. In Proceedings of the 2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), New Delhi, India, 14–18 October 2019; IEEE: New York, NY, USA, 2019; pp. 1–8. [Google Scholar]
Mangin, O.; Roncone, A.; Scassellati, B. How to be helpful? Supportive behaviors and personalization for human-robot collaboration. Front. Robot. AI 2022, 8, 725780. [Google Scholar] [CrossRef]
OhnBar, E.; Kitani, K.; Asakawa, C. Personalized dynamics models for adaptive assistive navigation systems. In Proceedings of the Conference on Robot Learning, PMLR, Zürich, Switzerland, 29–31 October 2018; pp. 16–39. [Google Scholar]
Kapusta, A.; Erickson, Z.; Clever, H.M.; Yu, W.; Liu, C.K.; Turk, G.; Kemp, C.C. Personalized collaborative plans for robot-assisted dressing via optimization and simulation. Auton. Robot. 2019, 43, 2183–2207. [Google Scholar] [CrossRef]
Umbrico, A.; Cesta, A.; Cortellessa, G.; Orlandini, A. A holistic approach to behavior adaptation for socially assistive robots. Int. J. Soc. Robot. 2020, 12, 617–637. [Google Scholar] [CrossRef]
Jarosz, M.; Nawrocki, P.; Sniezynski, B.; Indurkhya, B. Multi-Platform Intelligent System for Multimodal Human-Computer Interaction. Comput. Inform. 2021, 40, 83–103. [Google Scholar] [CrossRef]
Qin, S.J.; Badgwell, T.A. An overview of industrial model predictive control technology. In AIche Symposium Series; American Institute of Chemical Engineers, 1971-c2002: New York, NY, USA, 1997; Volume 93, pp. 232–256. [Google Scholar]
Kudva, A.; Sorourifar, F.; Paulson, J. Efficient Robust Global Optimization for Simulation-based Problems using Decomposed Gaussian Processes: Application to MPC Calibration. In Proceedings of the 2022 American Control Conference (ACC), Atlanta, GA, USA, 8–10 June 2022; pp. 2091–2097. [Google Scholar] [CrossRef]
Yao, C.; Sun, Z.; Xu, S.; Zhang, H.; Ren, G.; Ma, G. ANN Optimization of Weighting Factors Using Genetic Algorithm for Model Predictive Control of PMSM Drives. IEEE Trans. Ind. Appl. 2022, 58, 7346–7362. [Google Scholar] [CrossRef]
Ruiz, J. GitHub-Jruizgit/Rules: Durable Rules Engine—github.com. Available online: https://github.com/jruizgit/rules (accessed on 13 October 2024).
Acarturk, C.; Indurkya, B.; Nawrocki, P.; Sniezynski, B.; Jarosz, M.; Usal, K.A. Gaze aversion in conversational settings: An investigation based on mock job interview. J. Eye Mov. Res. 2021, 14. [Google Scholar] [CrossRef]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; IEEE: New York, NY, USA, 2008; pp. 413–422. [Google Scholar]
Breunig, M.M.; Kriegel, H.P.; Ng, R.T.; Sander, J. LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 15–18 May 2000; pp. 93–104. [Google Scholar]

Figure 1. Framework architecture.

Figure 2. Decision module architecture based on MPC approach.

Figure 3. Flowchart presenting adaptive decision module computations.

Figure 4. Flow chart presenting simulation flow.

Figure 5. Box plots presenting evaluation results of baseline models and models trained using various machine-learning algorithms: (random)—Random model, (ONER)—One rule model, (LGBM)—LGBM Model, (LGBM_IF)—LGBM model with IF outlier removal, (LGBM_OCS)—LGBM model with OCS outlier removal, (LGBM_LOF)—LGBM model with LOF outlier removal, (RF)—Random Forest model, (RF_IF)—Random Forest model with IF outlier removal, (RF_OCS)—Random Forest model with OCS outlier removal, (RF_LOF)—Random Forest model with LOF outlier removal, sum of rewards in simulation on the y-axis.

Figure 6. Box plots presenting adaptive decision time with the use of the following models: (random)—Random model, (ONER)—One rule model, (AdaBoost)—AdaBoost model, (LGBM)—LGBM Model, (RF)—Random Forest model.

Figure 7. Robot behavior adaptation results for Random Forest (top) and LGBM (bottom) algorithms used to train user models. The average reward for 30 simulation repetitions.

Table 1. Data sample describing the interaction between the user and the robot: values of 51 features describing an initial state, an action chosen by the robot in the given state, and data collected after action execution.

time_from_last_question	number_of_question_asked	time_from_last_event	silence_detected	user_speech_time	emotion_angry	emotion_disgust	emotion_fear	emotion_happy	emotion_sad	emotion_surprise	emotion_neutral	gaze_looking_on_middle	gaze_looking_on_down	gaze_looking_on_top_right	gaze_looking_on_left	gaze_looking_on_right
0.56	0.33	0.43	0.00	0.81	0.17	0.00	0.05	0.10	0.21	0.17	0.33	0.20	0.23	0.27	0.10	0.17
gaze_looking_on_up	robot_gaze_looking_on_middle	robot_gaze_looking_on_up	robot_gaze_looking_on_down	robot_gaze_looking_on_left	robot_gaze_looking_on_right	robot_gaze_looking_on_top_right	robot_gaze_strategy_contact_mode	robot_gaze_strategy_aversion_mode	robot_gaze_strategy_human_mode	robot_state_wait	robot_state_idle	robot_state_block_action	robot_state_action	robot_state_speaking	triggered_events_gaze_averting	triggered_events_emotion_occurred
0.03	1.00	0.00	0	0	0	0	1	0	0	0	1	0	0	0	0	0
triggered_events_question_detected	triggered_events_silence_detected	reaction_gesture_calm	reaction_gesture_normal	reaction_gesture_vivid	reaction_make_lauder	reaction_make_quieter	reaction_nothing	reaction_pitch_higher	reaction_pitch_lower	reaction_robot_gaze_off	reaction_robot_gaze_on	reaction_robot_movements_off	reaction_robot_movements_on	reaction_speak_faster	reaction_speak_slower	label
1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	veryBad

Table 2. Quality of User models trained using selected machine-learning algorithms and tested on the test data. The two rows with the best results are highlighted in bold font.

	Precision	Recall	F1-Score
random	0.43	0.33	0.36
one rule	0.24	0.28	0.24
adaBoost	0.65	0.61	0.63
LGBM	0.97	0.97	0.97
knn	0.65	0.62	0.63
tree	0.91	0.90	0.91
random forest	0.95	0.94	0.94

Table 3. Quality of User models trained using selected machine-learning algorithms and tested on the test data with outliers removed using the OCS algorithm. The two rows with the best results are highlighted in bold font.

	Precision	Recall	F1-Score
random	0.43	0.33	0.36
one rule	0.24	0.28	0.24
adaBoost	0.65	0.61	0.63
LGBM	0.97	0.97	0.97
knn	0.66	0.62	0.63
tree	0.91	0.90	0.90
random forest	0.94	0.94	0.94

Table 4. Quality of User models trained using selected machine-learning algorithms and tested on the test data with outliers removed using the LOF algorithm. The two rows with the best results are highlighted in bold font.

	Precision	Recall	F1-Score
random	0.43	0.33	0.36
one rule	0.24	0.28	0.24
adaBoost	0.65	0.61	0.63
LGBM	0.97	0.97	0.97
knn	0.65	0.62	0.63
tree	0.91	0.90	0.90
random forest	0.94	0.94	0.94

Table 5. Quality of User models trained using selected machine-learning algorithms and tested on the test data with outliers removed using the IF algorithm. The two rows with the best results are highlighted in bold font.

	Precision	Recall	F1-Score
random	0.43	0.33	0.36
one rule	0.24	0.28	0.24
adaBoost	0.65	0.61	0.63
LGBM	0.97	0.97	0.97
knn	0.65	0.62	0.63
tree	0.92	0.90	0.91
random forest	0.94	0.94	0.94

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jarosz, M.; Sniezynski, B. Personalization of Robot Behavior Using Approach Based on Model Predictive Control. Appl. Sci. 2024, 14, 11805. https://doi.org/10.3390/app142411805

AMA Style

Jarosz M, Sniezynski B. Personalization of Robot Behavior Using Approach Based on Model Predictive Control. Applied Sciences. 2024; 14(24):11805. https://doi.org/10.3390/app142411805

Chicago/Turabian Style

Jarosz, Mateusz, and Bartlomiej Sniezynski. 2024. "Personalization of Robot Behavior Using Approach Based on Model Predictive Control" Applied Sciences 14, no. 24: 11805. https://doi.org/10.3390/app142411805

APA Style

Jarosz, M., & Sniezynski, B. (2024). Personalization of Robot Behavior Using Approach Based on Model Predictive Control. Applied Sciences, 14(24), 11805. https://doi.org/10.3390/app142411805

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Personalization of Robot Behavior Using Approach Based on Model Predictive Control

Abstract

1. Introduction

2. Related Research

3. Proposed Solution

4. Evaluation

4.1. Dataset

4.2. Comparison of Machine-Learning Algorithms Used for Training User Models

4.3. Adaptation

4.4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI