Application of the Naive Bayes Classifier for Representation and Use of Heterogeneous and Incomplete Knowledge in Social Robotics

As societies move towards integration of robots, it is important to study how robots can use their cognition in order to choose effectively their actions in a human environment, and possibly adapt to new contexts. When modelling these contextual data, it is common in social robotics to work with data extracted from human sciences such as sociology, anatomy, or anthropology. These heterogeneous data need to be efficiently used in order to make the robot adapt quickly its actions. In this paper we describe a methodology for the use of heterogeneous and incomplete knowledge, through an algorithm based on naive Bayes classifier. The model was successfully applied to two different experiments of human-robot interaction.


Introduction
In the near future, in order to be effectively integrated into human environments, robots should be able to interact using an adequate level of cognition.Such cognition should be flexible enough to allow the robots to understand human actions, store knowledge efficiently and react accordingly, taking into account changes in context.
Among artificial intelligence techniques and machine learning methods, a wide variety of techniques are effective under determined conditions.For example, neural networks can approximate non-linear functions, whereas reinforcement learning can effectively solve game playing and pathfinding problems.Unsupervised learning, on the contrary, has the advantage of not needing labelled data for a training set.However, depending on the specific problem, most of these techniques may be inefficient.In general, the "no free lunch" theorem applies to machine learning [1], as for each specific problem the most fitting technique is going to be different.
Robots should learn from experience and from interactions with other agents they encounter in their environment in a similar way as animals, including humans.From the point of view of learning algorithms, this setting imposes certain constraints which often do not hold for mainstream machine learning applications.Such constraints include finite storage and memory, and the need to learn incrementally, by modifying behavior after each or every few experiences.With such conditions, online learning is the natural approach for robots [2].
Adaptive robotics is a term that is used to indicate that the study of robots can exhibit an adaptive behavior to the environment.The concept is true for several applications.
While a great deal of machine learning research has focused on batch learning, which is more amenable to mathematical formalization [3], more recently, due to the increasing availability of very large datasets and continuous streams of data, interest in online learning methods has flourished [4,5].In particular, reinforcement learning has often been adopted as a paradigm [6], together with evolutionary algorithms [7].An important recent development in learning of autonomous behavior in simulated agents was achieved in [8].According to [2], learning applications in robotics include inverse dynamics, human motion model, transition dynamics, state estimation model, and zero moment point control model, etc.
Approaches to robot learning include programming by demonstration, task learning and social, cognitive, and locomotion skill learning.Biologically inspired learning models are also used to investigate how teaching among humans or social animals can be applied to a robot [9].In particular, the human ability to learn in a few trials is very interesting for making robot learning more efficient.This subject has been studied in [10] and applied to feed-forward motor command.

Constraints in Learning during Interaction with Social Robots
In the case of a real world problem set in a human environment, the learning problem has some additional requisites.We summarize them here in three main points: Synthesis: unlike games or other applications in which rules are clear, the complex mechanisms of a real world situation may be obfuscated; they therefore need to be synthesized in a simplified model.
Efficiency: the solution (not necessarily optimal) to the learning problem should be found in a very short number of iterations and should not be computationally heavy.Refined learning techniques which require thousands or millions of iterations cannot be applied to a robot.Whereas they can be in simulation, they cannot be used when a physical machine is interacting with humans, with time constraints.
Clarity: for safety reasons, when a robot interacts with humans, a clear, easy-to-understand learning process, which leaves out uncertainty, is desirable rather than a "black box" (similar to a neural network), which might produce unpredictable behaviors.
Efficiency and clarity are requirements that robotics share with other fields of application of artificial intelligence, such as videogames [11].Conversely, synthesis is a process that is common in science whenever the complexity of the real problem has to be reduced upstream into a model.
Because of these constraints, it is often necessary to develop specific solutions.If successful, even ad hoc solutions can be generalized and become useful to other studies that adopt the same approach.
In social robotics, these constraints have to be considered.In particular, the concept of synthesis in making a model is applied to social data.This is because, for social robots, learning is used not only for transferring skills, tasks, and information [12], but also for showing aspects of human style social intelligence, and needing their knowledge to be based on deep models of human cognition and social competence [13].Making robots cognitive requires equipping them with social skills as well as physical skills necessary to interact [14].Synthesis of a human cognition therefore becomes a complex multidisciplinary issue in social robotics, where in order to build mental models or social models, quantitative and qualitative data extracted from literature of studies of human sciences are used.

Objectives of This Paper
The problem we are addressing involves learning through data extracted from heterogeneous sources such as studies from literature of human sciences.In this regard, we believe that the quantitative approach has an advantage over purely qualitative descriptions, such as symbolic learning techniques, because the application of a probabilistic model instead of a deterministic one allows the efficient use of all the information available; we therefore refer to statistical learning rather than machine learning.A probabilistic approach has been recently attempted in a study by Myagmarjav and Sridharan [15], where knowledge in human-robot interaction is acquired through selective active learning, using probabilities and queries.Lepora et al. also used the naive Bayes classifier in real world tasks on a moving robot with whiskers [16,17].Our method is close to these works.
Our objectives are the following: ‚ formalize empirical social behaviors into a dataset ‚ apply a learning technique to the dataset of correlations of features and actions ‚ make the robot actually learn socially appropriate actions through online adaptation Compared to existing research, limited in the use of empirical results of social science, the main contribution of this paper is to show how, through an algorithm based on the naive Bayes classifier, heterogeneous and incomplete knowledge from social robotics can be used for statistical learning.Experiments are shown as examples of application, to serve as a model in future similar studies in the same field.
Two applications are described: the implementation of a greeting selection system for a humanoid robot (the experiment was introduced in [18]), and the implementation of a behaviour selection system for attracting humans' attention (an experiment for which continuation was published in [19]).In both cases, the robot should be able to adapt its action selection to the evolving context, learning respectively an appropriate choice of gesture, and an appropriate choice of attention-attracting behavior.Merging empirical datasets that list correlations of features and actions with quantitative experimental data and their subsequent learning behavior is a task that poses several challenges, which can be addressed by our method.
The rest of the paper is organized as follows: in Section 2 we describe the concept of the algorithm; in Section 3 we show two possible applications; Section 4 concludes the paper.

Generic Model
We want to implement a system that enables effective online learning for a robot by using data extracted from literature of human sciences as well as experimental data.In Figure 1, the concept is described: a table is made out of training data; features' values are given as input, and the classifier outputs a chosen class that corresponds to an action to be implemented in the robot.The action is performed and evaluated through either questionnaires or experimental measurements.The resultant feedback updates the dataset.
classifier in real world tasks on a moving robot with whiskers [16,17].Our method is close to these works.
Our objectives are the following:  formalize empirical social behaviors into a dataset  apply a learning technique to the dataset of correlations of features and actions  make the robot actually learn socially appropriate actions through online adaptation Compared to existing research, limited in the use of empirical results of social science, the main contribution of this paper is to show how, through an algorithm based on the naive Bayes classifier, heterogeneous and incomplete knowledge from social robotics can be used for statistical learning.Experiments are shown as examples of application, to serve as a model in future similar studies in the same field.
Two applications are described: the implementation of a greeting selection system for a humanoid robot (the experiment was introduced in [18]), and the implementation of a behaviour selection system for attracting humans' attention (an experiment for which continuation was published in [19]).In both cases, the robot should be able to adapt its action selection to the evolving context, learning respectively an appropriate choice of gesture, and an appropriate choice of attention-attracting behavior.Merging empirical datasets that list correlations of features and actions with quantitative experimental data and their subsequent learning behavior is a task that poses several challenges, which can be addressed by our method.
The rest of the paper is organized as follows: in Section 2 we describe the concept of the algorithm; in Section 3 we show two possible applications; Section 4 concludes the paper.

Generic Model
We want to implement a system that enables effective online learning for a robot by using data extracted from literature of human sciences as well as experimental data.In Figure 1, the concept is described: a table is made out of training data; features' values are given as input, and the classifier outputs a chosen class that corresponds to an action to be implemented in the robot.The action is performed and evaluated through either questionnaires or experimental measurements.The resultant feedback updates the dataset.

Characteristics of Data
In human-robot interaction and in social robotics, it is common to work with quantitative as well as qualitative data extracted from literature of studies of human sciences such as sociology, anatomy, anthropology, and so on.From these data we can extract features and classes; however, there are usually some properties that limit the possible choice of classifying methods:

Characteristics of Data
In human-robot interaction and in social robotics, it is common to work with quantitative as well as qualitative data extracted from literature of studies of human sciences such as sociology, anatomy, anthropology, and so on.From these data we can extract features and classes; however, there are usually some properties that limit the possible choice of classifying methods:

‚
Heterogeneous data types: some features (such as gender) are binary values; some other (such as age) are continuous but can be discretized; some others (such as nationality) are categorical and not ordinal.Classes also can be represented in percentages or as absolute values, and features may be associated with more than one class to different degrees.Baynesian networks have been used to synthesize the findings from these separate studies of sociology, biology and economics [20], and can be used for representing and predicting social behaviors [21].However, they assume parent/child relations between variables, while in our problem we are assuming independence between class-condition feature probabilities.

‚
Incompleteness: studies are usually focused on a single or a couple of specific variables, whereas our model involves more variables.For example, a study with gender as a variable, may fix some variables (e.g., nationality of participants) while not specifying others (e.g., education level) which might be of interest.Missing data can make it difficult to use techniques for classification such as neural networks or to even just represent it in a space with principal component analysis.See [22] for a review of the problem from a statistical perspective.
‚ Set size: Small datasets limit the choice of training methods.Data from different sources can be integrated in order to expand the training dataset, but this will also cause the incompleteness problem stated above.In particular, when integrating human studies data with experimental data, we receive the data incrementally (rather than in batch).Online learning methods fit this kind of problem.Small experimental datasets have been used in modelling of complex processes successfully in [23], but the composition of the training datasets becomes critical, with designed sets performing better than random ones.Other learning models were compared for problems with small datasets in [24], where mixed results were found dependent on feature selection, and naive Bayes and its multinomial variation [25] were reported to outperform the others, including support vector machines, in many conditions.
Given these properties, we believe that the naive Bayes classifier is a good choice for dealing with this kind of data.Naive Bayes can easily handle weighted data, incomplete data, small datasets, and categorical variables.

Naive Bayes
First, let us define the sets of features of our problem f = {f 1 , ..., f I } and the sets of classes C = {C 1 , ..., C J }, given I number of features and J number of classes.The dataset is composed of K elements < f (k) , C (k) >, resulting in {<f (1) , C (1) >; ...; <f (K) , C (K) >}.
Equation (1) represents the generic formula of posterior probability for the class C j and the feature f i .
P `Cj ˇˇf ˘9P `Cj Naive Bayes is a simple generative probabilistic classification model which assumes independence between features of the objects to be classified [26].Therefore, the naive Bayes classifier applies Bayes's theorem with the assumption that the presence or absence of each feature is unrelated to other features.Its effectiveness in classification has been proven despite its independence assumption [27].Moreover, naive Bayes only requires a small amount of training data to estimate the parameters necessary for classification.These characteristics make it appropriate for the features of many problems in social robotics.
Under naive Bayes we label object C* according to Equation (2): C˚" arg max where P(C j ) is the prior probability of class C j and P(f i | C j ) is the likelihood of class C j with respect to feature f i .In general, these probabilities are estimated from the feature-class co-occurrence count table, e.g., using the maximum likelihood estimates as in Equation ( 3): The prior probabilities can be similarly derived from a count table.Thus these count tables form the sufficient statistics for the naive Bayes model.Count tables can be easily updated online, by simply incrementing class-feature counters, as each labelled example is processed, and the updated model can be immediately used for classifying new objects.In fact, the naive Bayes's sufficient statistics form an additive monoid, and admit both efficient online training and efficient parallel learning [28].

Conversion of Heterogeneous Data into a Dataset
When handling real data from different sources, we should distinguish the types of studies we are using in order to clearly define a semantics of weights w j that can be associated with classes C j and build a count table which will serve as a batch for training.Data can appear in different types, such as: (a) A percentage in which all (and only) the classes of our problem are considered (ideal).(b) A percentage in which one or more classes of our problem are not considered.(c) A percentage in which one of the classes (defined as "other") may include the classes of our problem which are not specifically mentioned.(d) Absolute values of measurement without a known scale (e.g., "15 times").(e) Absolute values of measurement, between a priori maximum and minimum values (e.g., "24 out of 30").Likert scales and differential semantic scales fall into this category.
Case (a) is the simplest, as percentages are directly turned into weights between 0 and 1.For example, in a study in which 55% of the population belongs to class C A , its w A will be 0.55.
For case (b), let us now consider the following simple sample data: ‚ Study 1: when feature 2 is 0, 55% of the population belongs to class C A and the rest to class C B , whereas when feature 2 is 1, the results change to 39% and 61%.
‚ Study 2: under different conditions (feature 1 = 1) and feature 3 fixed to 0, 50% of the population belongs to class C B , and the rest to classes C A and C C .
In Table 1, we organized these data, marking the field as "unknown" where variables are not considered in the study.When some conditions significantly change between two studies, the index of the considered study itself should be considered a feature.
Normalization is then necessary, because data refer to different scales in which different classes were involved.In this case, class C C is not present in the first study.Therefore, assuming a uniform prior estimation for the class C C as 1/J, where J is the number of classes, the weights for C A and C B can be normalized to sum 1.In the above case where w A = 0.39 and w B = 0.61, they will be reduced to 2/3 (as 1/3 is a prior estimation of w C ), becoming w A = 0.26 and w B = 0.41.
The way of assuming a prior estimation for the missing class depends on the interpretation of the data, and cases (c), (d) and (e) differ from (b).
In Figure 2 we show an excerpt from three different empirical studies [29][30][31] (dating up to 1983) on asymmetry in facial expressions which provide different data, respectively focusing on (c) culture, emotion, and its background meaning; (d) asymmetry, gender and emotions; and (e) emotional valence, gender and asymmetry.Typically, a possible purpose of gathering these data could be making a robot capable of autonomously determining when an asymmetrical facial expression is appropriate.The way of assuming a prior estimation for the missing class depends on the interpretation of the data, and cases (c), (d) and (e) differ from (b).
In Figure 2 we show an excerpt from three different empirical studies [29][30][31] (dating up to 1983) on asymmetry in facial expressions which provide different data, respectively focusing on (c) culture, emotion, and its background meaning; (d) asymmetry, gender and emotions; and (e) emotional valence, gender and asymmetry.Typically, a possible purpose of gathering these data could be making a robot capable of autonomously determining when an asymmetrical facial expression is appropriate.In case (c) it may be desirable to split the remaining part to all the classes not specifically mentioned.For example, (as in Figure 2c, Happiness-Wisconsin) if the data reports 58.3% class CA, 16.7% class CB, 11.1% class CC, and the remaining 13.9% other, supposing that our problem has five classes, the latter part can be approximated as the sum of the weights of classes CD and CE.Filling wD = 0.0695 and wE = 0.0695 and adding this data to the dataset will force these weights to be low, where a uniform estimation of 0.25 would be considered too high.
In case (d), in which, as in Figure 2, the number of subjects are counted, it is enough to apply the same method as (b), assuming a uniform prior estimation for the missing classes as 1/J. Figure 2e shows an example of values measured in pixels, therefore having a maximum and minimum value: data vary within that scale.In this case it may be more reasonable to assign the average value of the scale to unknown classes.A further example with simpler data can clarify better the reason of this approach in case (e).In a problem with four classes with two of them unknown, a data vector could be <18/30, 24/30, ?, ?>, which equals <0.6, 0.8, ?, ?>. Fixing a uniform estimate of 0.25 (as in case (d)) for wC and wD, and then normalizing only wA and wB, the weights vector would result in <0.21, 0.29, 0.25, 0.25>: having wC and wD higher than wA may not be reasonable.Instead, assuming data for CC and CD as 15/30 (the average value), adding the resulting wC = 0.5 and wD = 0.5, to the dataset and then normalizing it will force them to be higher: the weights vector would be <0.25,0.33, 0.21, 0.21>.How the relative weight of wA changes compared to unknown classes proves the importance of the semantics.
In all the above cases, after calculating the unknown weights, their addition to the dataset is not strictly necessary for the naive Bayes classifier to work, but it does become necessary depending on the policy described in Section 2.6.1.In case (c) it may be desirable to split the remaining part to all the classes not specifically mentioned.For example, (as in Figure 2c, Happiness-Wisconsin) if the data reports 58.3% class C A , 16.7% class C B , 11.1% class C C , and the remaining 13.9% other, supposing that our problem has five classes, the latter part can be approximated as the sum of the weights of classes C D and C E .Filling w D = 0.0695 and w E = 0.0695 and adding this data to the dataset will force these weights to be low, where a uniform estimation of 0.25 would be considered too high.
In case (d), in which, as in Figure 2, the number of subjects are counted, it is enough to apply the same method as (b), assuming a uniform prior estimation for the missing classes as 1/J. Figure 2e shows an example of values measured in pixels, therefore having a maximum and minimum value: data vary within that scale.In this case it may be more reasonable to assign the average value of the scale to unknown classes.A further example with simpler data can clarify better the reason of this approach in case (e).In a problem with four classes with two of them unknown, a data vector could be <18/30, 24/30, ?, ?>, which equals <0.6, 0.8, ?, ?>. Fixing a uniform estimate of 0.25 (as in case (d)) for w C and w D, and then normalizing only w A and w B , the weights vector would result in <0.21, 0.29, 0.25, 0.25>: having w C and w D higher than w A may not be reasonable.Instead, assuming data for C C and C D as 15/30 (the average value), adding the resulting w C = 0.5 and w D = 0.5, to the dataset and then normalizing it will force them to be higher: the weights vector would be <0.25,0.33, 0.21, 0.21>.How the relative weight of w A changes compared to unknown classes proves the importance of the semantics.
In all the above cases, after calculating the unknown weights, their addition to the dataset is not strictly necessary for the naive Bayes classifier to work, but it does become necessary depending on the policy described in Section 2.6.1.

Customisation of Naive Bayes Formulas
In our naive Bayes classifier, the selection of the class C* is done through a formula different from the standard one.

Conditional Probability
First, the standard formula of conditional probability is rewritten in Equation ( 4), taking into account the fact that class membership is based on weights w.The probability of f i , the i-th feature of f, being equal to a value v, is found through the sum, for each k-th sample < f (k) , C (k) > in the dataset, of the weights that are defined when f i = v through a multiplier δ (k) (v) (which can be 0 or 1).The formula of class priors P(C j ) is rewritten in the same way in Equation (5).
The best class C* is selected through Equation ( 6): C˚" arg max In our customized formula (Equation ( 7)), we leave class priors P(C j ) out.This additional assumption is made because we do not want to give more weight to more common class a priori, so only independent probability distributions are considered.

C˚" arg max
Leaving out the class priors is equivalent to assuming a uniform probability distribution over the classes.Class distributions estimated from very small samples are inherently noisy, and we believe using a flat prior here is a legitimate choice.An example can clarify the need for this modification.The typical case is when having a very small dataset, like the one in Table 2, in which one class appeared more times than the others.Using the classification formula of Equation ( 6), the mapping of the maximum likelihoods for each possible combination of features would result as in Table 3 on the left.The bias towards class 2 comes from the priors P(C j ), which would be <0.2501,0.3749, 0.2502, 0.2500>.Without such bias, the mapping will appear as in Table 3 on the right.Extending the concept to a bigger mapping, it is easy to predict that priors can unbalance it and penalize less common classes, which may never be selected.

. Smoothing Technique
Conditional probabilities P(fi | Cj) are balanced out through an add- smoothing technique, namely the use of m-estimate [32], which avoids the inconveniences that may happen when the number of total occurrences of a feature under certain conditions equals 0. In generic terms, for estimating conditional probabilities in a table of data, instead of using the standard formula in Equation ( 8), the formula in Equation ( 9) is used.The probability of A given B does not depend only on the joint probability of A and B divided by the probability of B, but is balanced out by p and m.We need p as a non-zero prior estimate for P(A | B), which we suppose uniformly distributed, and a number m that says how confident we are of this prior estimate p.
The value m is defined as the equivalent sample size of m-estimate formula.It can be tuned on a development set; usually a small number is convenient for small sets of data, in order to balance out probability from training data compared to the weight of p.The value p is the uniform prior estimation of the probability.It is usually set as 1/J, where J is the number of classes.
As a result, the generic formula of the classifier is shown in Equation (10):

Incomplete Data
As we may have to deal incomplete data, the multiplier  (k) (v), a function introduced for the calculation of the joint probability of a certain feature with a certain class, can be customized as well, as in Equation (11).
Depending on the quantity of incomplete training data,  (k) (v) can be defined arbitrarily.It can be 0.5 or even less, in case incomplete data is considered not reliable.

Adaptation through Rewards
In our model, the representation of knowledge has to be adaptive to feedback collected from 2.4.2.Smoothing Technique Conditional probabilities P(f i | C j ) are balanced out through an add-ε smoothing technique, namely the use of m-estimate [32], which avoids the inconveniences that may happen when the number of total occurrences of a feature under certain conditions equals 0. In generic terms, for estimating conditional probabilities in a table of data, instead of using the standard formula in Equation ( 8), the formula in Equation ( 9) is used.The probability of A given B does not depend only on the joint probability of A and B divided by the probability of B, but is balanced out by p and m.We need p as a non-zero prior estimate for P(A | B), which we suppose uniformly distributed, and a number m that says how confident we are of this prior estimate p.
The value m is defined as the equivalent sample size of m-estimate formula.It can be tuned on a development set; usually a small number is convenient for small sets of data, in order to balance out probability from training data compared to the weight of p.The value p is the uniform prior estimation of the probability.It is usually set as 1/J, where J is the number of classes.

P pA|Bq "
P pA X Bq P pBq (8) As a result, the generic formula of the classifier is shown in Equation (10):

Incomplete Data
As we may have to deal incomplete data, the multiplier δ (k) (v), a function introduced for the calculation of the joint probability of a certain feature with a certain class, can be customized as well, as in Equation (11).
Depending on the quantity of incomplete training data, δ (k) (v) can be defined arbitrarily.It can be 0.5 or even less, in case incomplete data is considered not reliable.

Adaptation through Rewards
In our model, the representation of knowledge has to be adaptive to feedback collected from experimental data.Therefore, the model includes rewards or penalties depending on the feedback.
Ideally, the algorithm has to adapt quickly.As in human-robot interaction where we are dealing with real world problems rather than abstract ones, the desired amount of iterations necessary for a complete adaptation from the initial mapping to another one should be comparable to the number of interactions humans need to understand behavior rules.The process should not require hundreds or thousands of steps.
Each time a class is selected and an action is executed by the robot, probabilities of the current data sample <f *, C*> at step T are considered weights and updated to the new weights at step T + 1 through the formula in Equation (12).
wpT `1q " wpTq `l ¨r ¨d (12) The learning rate l, the reward factor r and the value d depend on the specific implementation.For instance, if the feedback data is obtained from 5-point Likert scales, r could be a value among {´1, ´0.5, 0, 0.5, 1} depending on the questionnaire results, whereas d could be 1´w(T) or w(T) depending on the questionnaire feedback being >3 (positive reward) or <3 (negative reward).Otherwise, if feedback is obtained from experimental data such as measurement of participant's response of some kind, the formula for r could be tailored to be proportional to performance such as the success of an action, the number of hits, or inversely proportional to negative performance such as delay of response.Through the combination of these factors, a cost/reward function can be tailored for the experiment.
The learning rate l can be kept at 1 for assigning equal importance to each step.The adapting process can be forcefully led to convergence if l is a decreasing function instead (for example, following the e ´x curve), when real world scenario constraints make it necessary.
Rewards are commonly used in reinforcement learning too, where transitions from states to other states happen through a range of actions: rewards affect good states or actions when the goal is reached.In our case, rewards modify the weight actions associated to states, but there is no transition between one state and another: one whole iteration consists in just one state, defined by current feature values.

Class Selection
Whenever a new vector of input features f* has to be classified, the following cases may happen:

‚
The set of features f* is not present in the dataset.In this case, naive Bayes is calculated.

‚
The set of features f* is already present in the dataset and the vector is associated with some weights.In this case, classification can be done either by: o using the current weights: generically speaking, in order for this option to be possible, data has to be consistent within all classes for f*, with missing data previously filled as explained in Section 2.3.o ignoring the current weights, and recalculating the probabilities through naive Bayes (leave-one-out cross-validation [33]).
Regardless of the vector of the current input features f* corresponding to a vector of new probabilities or to a vector of previous weights, in either case there are several possible policies for the selection of the output class C*:

‚
The simplest solution is to get the maximum value, as in the standard Equation ( 6) and Equation ( 7).

‚
Another possibility, which gives more emphasis to exploration, is to use a ε-greedy policy: with 0 < ε < 1, C* is the argmax with ε probability, and will be a random selection with 1´ε probability.
‚ C* can be assigned any of the possible classes, with probability proportional to the list of weights.For example, in a weight vector <0.4,0.2, 0.1, 0.3>, the first class would be selected with 0.4 probability.

Stopping Conditions
The adaptation process can potentially be continuous, or can be stopped depending on criteria that are up to the experimenter.When stopping conditions are satisfied, rewards will no longer apply, and the system will be considered fully trained.
One possible method is to verify the maximum likelihoods for each possible combination of features.Making a mapping of such values, like in Table 3, we form a table in which each cell represents a "state" and is associated with an "action" represented by the class with maximum likelihood C*.Then, stopping conditions can be triggered when changes of C* are stabilized.This condition is verified when the following two conditions (Equations ( 13) and ( 14)) are true at the same time.One condition ensures that all states have been explored once (there is at least one entry k of the dataset, defined as <f (k) , C (k) >, in which f (k) corresponds to that features vector).The other condition checks whether the moving average of C* changes during the latest W (window length) iterations to decrease below a threshold.@ f , Dk : In Equation ( 14), σ C is the number of states in which C* changed; σ TOT is the total number of states.The threshold θ can be set as q σ TOT (tolerance of average q changes, with q to be determined according to the desired convergence).

Results and Discussion
In this section we will describe two applications of the algorithm.We will not enter into detail of the research behind those experiments, but rather show the effectiveness of our method.In both applications, some details of the algorithm have been customized.Discussion follows in the last part.

Purpose of the Study
As humans, we greet each other following sometimes complex or unclear rules which vary by country or even by region.Nevertheless, when exposed to a new culture, we are able to adapt to a new set of rules after only a few interactions.The purpose of this study was to make a humanoid robot adapt to the German way of greeting, while being initially trained with Japanese social data about greeting rules.Culture differences are indeed important for ensuring technology acceptance [34], including more complex machines such as robots.In a recent work, Heenan et al. made a state machine model for greetings comprehensive of the approach phase [35]; however, it does not take cultural factors into consideration, which is the focus of this experiment.ARMAR-III [36], a humanoid robot designed for close cooperation with humans, was used in this experiment, which took place in the room shown in Figure 3.

Greeting Selection System
The implementation of the greeting selection system is composed of two parts regarding gesture and speech; in Figure 4 and in the next paragraph we describe only the part related to gestures.The system takes context data (in blue) as input features and produces the appropriate robot posture for that input.Robot posture configuration is implemented through a process described in [18].
The values of features and classes regarding greeting gestures is summarized in Table 4.As the experiment consists of adapting from knowledge extracted from one culture (Japanese) to another (German), culture acts as a discriminant rather than being an input feature.In this way, for different cultures, there will be different mappings.
Mappings are updated using experimental data in the form of questionnaires filled by experiment participants.In order to be initialized, the only needed training set is the Japanese one: before beginning the experiment, a table made from Japanese sociology data and a mapping called M0J was built.After interacting with German people, the resulting adapted mapping M1 is expected

Greeting Selection System
The implementation of the greeting selection system is composed of two parts regarding gesture and speech; in Figure 4 and in the next paragraph we describe only the part related to gestures.

Greeting Selection System
The implementation of the greeting selection system is composed of two parts regarding gesture and speech; in Figure 4 and in the next paragraph we describe only the part related to gestures.The system takes context data (in blue) as input features and produces the appropriate robot posture for that input.Robot posture configuration is implemented through a process described in [18].
The values of features and classes regarding greeting gestures is summarized in Table 4.As the experiment consists of adapting from knowledge extracted from one culture (Japanese) to another (German), culture acts as a discriminant rather than being an input feature.In this way, for different cultures, there will be different mappings.
Mappings are updated using experimental data in the form of questionnaires filled by experiment participants.In order to be initialized, the only needed training set is the Japanese one: before beginning the experiment, a table made from Japanese sociology data and a mapping called M0J was built.After interacting with German people, the resulting adapted mapping M1 is expected The system takes context data (in blue) as input features and produces the appropriate robot posture for that input.Robot posture configuration is implemented through a process described in [18].
The values of features and classes regarding greeting gestures is summarized in Table 4.As the experiment consists of adapting from knowledge extracted from one culture (Japanese) to another (German), culture acts as a discriminant rather than being an input feature.In this way, for different cultures, there will be different mappings.
Mappings are updated using experimental data in the form of questionnaires filled by experiment participants.In order to be initialized, the only needed training set is the Japanese one: before beginning the experiment, a table made from Japanese sociology data and a mapping called M0J was built.After interacting with German people, the resulting adapted mapping M1 is expected to represent German rules of greeting interaction.A mapping M0G made from a table of German sociology data of gestures was built, too, and only used for validation.Our hypothesis is that M1 will be closer to M0G than to M0J.  5, let us summarize the concept of the algorithm and the way rewards are calculated: 1.
The dataset is built from training data: weights w j (f ) corresponds to each vector added.

2.
Whenever a new feature vector f * is given as input, it is checked whether it is already contained in the dataset or not.In the former case, the weights are directly read from the dataset and the greeting corresponding to the highest weight is selected; in the latter case, classification is calculated through naive Bayes.

3.
In the naive Bayes classifier, the best greeting g* chooses the greeting g j that has the highest probability, calculated from its weights w i , using the add-ε smoothing technique and a multiplier δ, as in Equation (10).

4.
Once the greeting is chosen, the resulting probabilities are normalized.The stopping condition is then calculated as in Equations ( 13) and (14).If all conditions are satisfied, no updating will be performed, as the mapping has already been stabilized.

5.
Otherwise, the next step consists of getting the evaluation from the participant for the current selected greeting g*, whether appropriate or not according to the participant's culture, to the current context f*.On a scale from 1 to 5, if it is greater than 3, the weight of that greeting for the present context is multiplied by a positive reward.If less than 3, is it multiplied by a negative reward; if it is exactly 3, nothing is done.All vectors f start with a counter s set to 0, and every time one vector is processed, its counter increases and makes the learning factor decrease, dampening the magnitude of the rewards.6.
If the evaluation is less than or equal to 3, the participant is also asked to indicate which greeting type instead would have been appropriate in this context f*.The weight of that greeting g** is boosted.7.
The participant is finally asked to indicate, for the chosen greeting type g*, which context f** would have sounded appropriate.If there is any, the weights corresponding to f** are updated with a boost for the current greeting; otherwise, if g* is judged inappropriate in any case, all the weights receive a negative reward.The vector f** is added to the dataset if new, or updated if already existing.8.
All the new weights in the dataset are normalized (the sum of all probabilities of greeting types for a single context combination has to be 1).At this point, the algorithm is ready for a new input, and goes back to step 1.The next time that the input feature vector is the same as the one just added, the weights will be directly used (step 2 instead of 3).
weights receive a negative reward.The vector f** is added to the dataset if new, or updated if already existing.8.All the new weights in the dataset are normalized (the sum of all probabilities of greeting types for a single context combination has to be 1).At this point, the algorithm is ready for a new input, and goes back to step 1.The next time that the input feature vector is the same as the one just added, the weights will be directly used (step 2 instead of 3).
As explained in steps 4, 5 and 6, the questionnaire is made up of three questions; this means that at the same time the rewards may affect directly three weights in two cells of the mapping (and indirectly all the others), making the learning process much faster.k-th feature vector in the dataset j = 1 … J Index of greetings f * Feature vector selected by the classifier g j j-th greeting in G f ** Feature vector suggested by the participant to match g * g (k)  Greeting at the k-th element in the dataset v Value that can be taken by a feature f i g * Greeting chosen by the classifier Learning factor.High (around 0.8) at the beginning and decreases following the e -x curve S Stopping conditions as in Equations ( 13) and ( 14), with W = 10 and q = 2/ TOT Multiplier for incomplete data, as in Equation (11).Set empirically low in case of undefined f i , due to the high quantity of incomplete training data

Experiment Results and Validation
In this experiment, 18 German people of different ages, gender, workplace, and knowledge of the robot were invited to participate.Some of them participated in a second interaction.At the end, all the feature values had the chance to be classified at least once except for "Location = Private."As explained in steps 4, 5 and 6, the questionnaire is made up of three questions; this means that at the same time the rewards may affect directly three weights in two cells of the mapping (and indirectly all the others), making the learning process much faster.

S
Stopping conditions as in Equations ( 13) and ( 14), with W = 10 and q = 2/σ TOT δ (k) = {1, 0.2, 0} Multiplier for incomplete data, as in Equation (11).Set empiricallylow in case of undefined f i , due to the high quantity of incomplete training data In this experiment, 18 German people of different ages, gender, workplace, and knowledge of the robot were invited to participate.Some of them participated in a second interaction.At the end, all the feature values had the chance to be classified at least once except for "Location = Private." The number of interactions, including repetitions, was 30: it was determined by the stopping condition of the algorithm.Each time a state, determined by a combination of feature values, had its weights modified during the feedback after each interaction, it is counted as "visited."The count of how many times states have been visited is shown in Table 5.The new mapping of gestures was verified through an objective function V described in the generic Equation (15), which measures the difference between two different mappings, M1 and M2.
V " The function calculates the sum of the variance between the weights w in the same features vector f in two different mappings, M1 and M2.Each variance in the weights is calculated not only by comparing the greeting with maximum likelihood, but also considering the sum of the variances for each greeting j.
The function applied to M0J (Japanese initial mapping) and M1 (final mapping) gives 0.636 as the result.Conversely, comparing M1 with M0G (German initial mapping), 0.324 is obtained.The t-test of the variances for each f proves the difference to be significant (p = 0.02), validating the hypothesis that M1 would become closer to M0G than to M0J.This result supports the evolution of mapping M1 from M0J towards M0G, assuming that M0G represents German rules correctly: the robot learned a gesture for each social context that is close to the German and more appropriate than the Japanese one.How weights evolve is shown in Tables 6 and 7 where on the left the tables are at the starting condition (Japanese): on the right, variances decrease in Table 6 and increase in Table 7.

Purpose of the Study
Socially assistive robots [37] are a category of robots that can actively assist humans (especially elderly people) thanks to their ability of navigation and their multimodal communication capabilities.However, in order to communicate with a human partner, the robot needs to attract human's attention.Different modalities of communication have been evaluated [38] and compared with multimodal communication cues [39].The purpose of the present experiment is to explore strategies for successfully initiating communication with humans in a changing environment, making a robot learn which communication channel is more effective for each condition of the room where the interaction occurs.The social robot Nao [40] was used for this application, which was divided in three separate experiments (one of them published in [19]), with different conditions among them.We will examine in detail only the first one.In order to assess the effectiveness of learning, the final probability matrix distribution and the average rating distribution of every behavior in the questionnaires were compared.A correlation test was conducted and we found a positive correlation (Pearson correlation of 0.716), proving that the preferences of the participants were learned.Figures 7 and 8 show the variation of average ratings and respective probability distributions for every action.This essentially means that the robot has learned to adapt its attention-attracting behavior selection to the evolving context.

About this Approach
The term "machine learning" was not used in this paper to refer to our solution, which is more appropriately referred to as "statistical learning." We believe the advantages of our approach are its feasibleness and its potential to be used by other researchers, tuning it according to their needs, especially if the datasets are made of heterogeneous or incomplete knowledge, and if the goal is to adapt through experimental data.Other In order to assess the effectiveness of learning, the final probability matrix distribution and the average rating distribution of every behavior in the questionnaires were compared.A correlation test was conducted and we found a positive correlation (Pearson correlation of 0.716), proving that the preferences of the participants were learned.Figures 7 and 8 show the variation of average ratings and respective probability distributions for every action.This essentially means that the robot has learned to adapt its attention-attracting behavior selection to the evolving context.In order to assess the effectiveness of learning, the final probability matrix distribution and the average rating distribution of every behavior in the questionnaires were compared.A correlation test was conducted and we found a positive correlation (Pearson correlation of 0.716), proving that the preferences of the participants were learned.Figures 7 and 8 show the variation of average ratings and respective probability distributions for every action.This essentially means that the robot has learned to adapt its attention-attracting behavior selection to the evolving context.

About this Approach
The term "machine learning" was not used in this paper to refer to our solution, which is more appropriately referred to as "statistical learning." We believe the advantages of our approach are its feasibleness and its potential to be used by other researchers, tuning it according to their needs, especially if the datasets are made of heterogeneous or incomplete knowledge, and if the goal is to adapt through experimental data.Other   In order to assess the effectiveness of learning, the final probability matrix distribution and the average rating distribution of every behavior in the questionnaires were compared.A correlation test was conducted and we found a positive correlation (Pearson correlation of 0.716), proving that the preferences of the participants were learned.Figures 7 and 8 show the variation of average ratings and respective probability distributions for every action.This essentially means that the robot has learned to adapt its attention-attracting behavior selection to the evolving context.

About this Approach
The term "machine learning" was not used in this paper to refer to our solution, which is more appropriately referred to as "statistical learning." We believe the advantages of our approach are its feasibleness and its potential to be used by other researchers, tuning it according to their needs, especially if the datasets are made of heterogeneous or incomplete knowledge, and if the goal is to adapt through experimental data.Other

About this Approach
The term "machine learning" was not used in this paper to refer to our solution, which is more appropriately referred to as "statistical learning." We believe the advantages of our approach are its feasibleness and its potential to be used by other researchers, tuning it according to their needs, especially if the datasets are made of heterogeneous or incomplete knowledge, and if the goal is to adapt through experimental data.Other more refined machine learning solutions are more appropriate for other kinds of studies, in which the focus is not on interaction and the learning problem is more complex.
This solution has also the advantage of being independent from the implementation and of not being robot-specific.Its limitation is the consequence of such an advantage: if the algorithm is implemented separately, its integration with the robot's environment is also necessary every time.

Customization
The method we proposed can be customized and extended.In the applications described, the following variations were attempted:

‚
Class priors left out of the probability formula: alternative solutions are possible, such as a function that biases towards low probability classes using the inverse of the priors, or towards the classes that have been selected fewer times.

Conclusions
An adequate level of cognition is necessary to employ a robot in a human environment.There is a need of representing knowledge, which can be incomplete and heterogeneous, in an effective and efficient way.Adaptation to changing environments can be carried out through different learning techniques, which can be more or less effective depending on the specific problem.This work introduces a quick and effective way of handling this kind of problem adaptation based on statistical learning, specifically on naive Bayes.The method we proposed was applied to two different experiments of human-robot interaction.In the first, the humanoid robot ARMAR-IIIb adapted its greeting behavior from the Japanese to German style, using Japanese sociology data as training.In the second, the robot Nao learned different ways of attracting human users' attention depending on different contexts.In both cases, the proposed method can be customized and readapted to be used on other robots.Future works include comparison with other machine learning techniques.

Figure 1 .
Figure 1.Generic model of action selection with feedback.

Figure 1 .
Figure 1.Generic model of action selection with feedback.

Figure 2 .
Figure 2. Scan of excerpts from social studies on facial expressions.

Figure 2 .
Figure 2. Scan of excerpts from social studies on facial expressions.

Figure 3 .
Figure 3. Setup of the room of the experiment: (a) curtain covering the view of the participant, who enters the room and reaches position (b) face to face with the robot (c).

Figure 4 .
Figure 4. Greeting selection model.In blue, input features; in yellow, the discriminant of the mappings; in green, the output.

Figure 3 .
Figure 3. Setup of the room of the experiment: (a) curtain covering the view of the participant, who enters the room and reaches position (b) face to face with the robot (c).

Figure 3 .
Figure 3. Setup of the room of the experiment: (a) curtain covering the view of the participant, who enters the room and reaches position (b) face to face with the robot (c).

Figure 4 .
Figure 4. Greeting selection model.In blue, input features; in yellow, the discriminant of the mappings; in green, the output.

Figure 4 .
Figure 4. Greeting selection model.In blue, input features; in yellow, the discriminant of the mappings; in green, the output.

Figure 5 .
Figure 5. Structure of the algorithm for greeting interaction, with the explicit formulas.K Number of samples in the dataset i = 1 … I Index of features k = 1 … K k-th sample <f, g j > in the dataset f Feature input vector G Set of greetings (classes)f i i-th feature of f J = 5Number of possible greeting choices f(k)  k-th feature vector in the dataset j = 1 … J Index of greetings f * Feature vector selected by the classifier g j j-th greeting in G f ** Feature vector suggested by the participant to match g * g(k)  Greeting at the k-th element in the dataset v Value that can be taken by a feature f i g * Greeting chosen by the classifier w j Weight of the gesture j for the feature vector f g ** Greeting chosen by the participant in case g * receives a low score m = 2 Equivalent sample size of m-estimate formula I = 4 Number of input features p = 1/J Uniform prior estimation of the probability s Counter of visits of the current f *

Figure 5 .
Figure 5. Structure of the algorithm for greeting interaction, with the explicit formulas.

Kg ˚Greeting chosen by the classifier w j p f q Weight of the gesture j for the feature vector f g ˚˚Greeting chosen by the participant in case g ˚receives a low score m = 2 Equivalents
Number of samples in the dataset i = 1 . . .I Index of features k = 1 . . .K k-th sample <f, g j > in the dataset f Feature input vector G Set of greetings (classes) f i i-th feature of f J = 5 Number of possible greeting choices f pkq k-th feature vector in the dataset j = 1 . . .J Index of greetings f ˚Feature vector selected by the classifier g j j-th greeting in G f ˚˚Feature vector suggested by the participant to match g gpkq Greeting at the k-th element in the dataset v Value that can be taken by a feature f i Counter of visits of the current f ˚r ={1, 0.5, 0, ´0.5, ´1} Reward factor, depending on the evaluation of the user l =exp(-s/4) Learning factor.High (around 0.8) at the beginning and decreases following the e ´x curve

Figure 7 .
Figure 7. Boxplots of the average behavior rating.

Figure 8 .
Figure 8. Boxplots of the probability values in the final mapping.

Figure 7 .
Figure 7. Boxplots of the average behavior rating.

Figure 8 .
Figure 8. Boxplots of the probability values in the final mapping.

Figure 7 .
Figure 7. Boxplots of the average behavior rating.

Figure 7 .
Figure 7. Boxplots of the average behavior rating.

Figure 8 .
Figure 8. Boxplots of the probability values in the final mapping.

Figure 8 .
Figure 8. Boxplots of the probability values in the final mapping.

‚
One feature turned into a discriminant in order to have two separate mappings ‚ Adaptation of one mapping made from literature to a new one VS adaptation without training dataset ‚ Rewards affecting more than one cell of the mapping through questionnaires ‚ Training data: from literature of human studies VS from corpora VS from experimental data ‚ Reward: from questionnaires VS from cost function ‚ Learning: incremental for all participants VS separate for each participant ‚ Class selection: using directly the weights VS leave-one-out cross-validation

Table 1 .
A portion of raw training data.

Table 2 .
A portion of training data, shown as sample.

Table 3 .
Examples of mapping depending on class priors.

Table 3 .
Examples of mapping depending on class priors.

Table 4 .
Feature values and classes of the greeting gestures model.

Table 5 .
Count of visited states at T = 30.

Table 9 .
Evolution of maximum likelihoods for one iteration.

Table 9 .
Evolution of maximum likelihoods for one iteration.

Table 9 .
Evolution of maximum likelihoods for one iteration.

Table 9 .
Evolution of maximum likelihoods for one iteration.