1. Introduction
The number of humans who interact with robots in a social context is increasing. In contrast to industrial robots, the purpose of social robots is not the accurate performance of a specific task in a highly constrained environment, but to assist in therapy, education or social services [
1]. Nowadays, social robots are already employed in many different areas to help elderly people [
2,
3], people with dementia [
4], autistic children [
5] as well as people with disabilities [
6]. Previous research showed that social robots that follow human social norms, like empathy, are seen as more supportive [
7] and friendly [
8]. A social norm defines how people should behave, thereby defining specific expectations, which, when violated, lead to specific reactions, including sanctions or punishment [
9,
10]. Thus, enabling robots to follow social norms has the potential to make human–robot interactions more enjoyful, predictable, natural, and, in general, more similar to interactions between humans. However, due to the large number of social norms, their dynamic nature, i.e., they can change over time, and their strong variation based on the environment of the interaction as well as the personality of the human the robot is interacting with, they cannot be hard-coded into the robot but must instead be learned. Learning can either occur through trial-and-error, i.e., reinforcement learning [
11], by observing how humans interact with each other, i.e., learning from demonstration [
12], or by utilizing abstract knowledge available in written form, e.g., on the web or in books. While there have been several studies that investigated learning from demonstration and reinforcement learning to learn social norms [
13,
14,
15,
16], there have not been any studies that investigated learning of social norms from abstract knowledge provided in natural language. Understanding natural language is non-trivial and requires sophisticated language grounding mechanisms that provide meaning to language by linking words and phrases to corresponding concrete representations, which represent sets of invariant perceptual features obtained through an agent’s sensors that are sufficient to distinguish percepts belonging to different concepts [
17]. Most grounding research has focused on understanding natural language instructions so that robots can identify and manipulate the correct object [
18,
19] or navigate to the correct destination [
20], while, to the best of our knowledge, no attempts have been made to ground more abstract concepts, such as emotion types, emotion intensities and genders, which are essential to understand natural language texts describing social norms, such as empathy.
In this study, we try to fill this gap by proposing an unsupervised online grounding framework, which uses cross-situational learning to ground words describing emotion types, emotion intensities and genders through their corresponding concrete representations extracted from audio with the help of deep learning. The proposed framework is evaluated through a simulated human–agent interaction experiment in which the agent listens to the speech of different people and receives at the same time a natural language description, describing the gender of the observed person as well as the experienced emotion. Furthermore, the proposed framework is compared to a Bayesian grounding framework that has been employed in several previous studies to ground words through a variety of different percepts [
18,
19,
20,
21].
The remainder of this paper is structured as follows: the next section provides some background regarding cross-situational learning. Afterwards,
Section 3 discusses related work in the area of language grounding. The proposed framework, the baseline and the employed experimental setup are explained in
Section 4,
Section 5 and
Section 6.
Section 7 describes the obtained results. Finally,
Section 8 concludes the paper.
3. Related Work
Since Harnad [
17] proposed the “Symbol Grounding Problem”, a variety of models which either utilize unsupervised or interactive learning mechanisms have been proposed to create connections between words and corresponding concrete representations. Interactive learning approaches are based on the assumption that another agent is available that already knows the correct connections between words and concrete representations so that it can support the learning agent by providing feedback and guidance. Due to this support, interactive learning models are usually faster, and often also more accurate than unsupervised learning models; however, they do not work in the absence of a tutor who provides the required support. Furthermore, in most studies, e.g., [
32,
33], the tutoring agent did not provide real natural language sentences but only single words, which significantly simplifies the grounding problem and raises the question of whether these models would work outside the laboratory. Since, in real environments, the tutor would be a regular user and might not be aware of the limitations of the learning agent or unwilling to adjust the interaction accordingly. Examples of interactive grounding approaches include the “Naming Game” [
34], which has been used in many studies to ground a variety of percepts, such as colors or prepositions [
32,
33], and the work by She et al. [
35]. The latter used a dialog system to ground higher level symbols through already grounded lower level symbols, thereby introducing another constraint that a sufficiently large set of already grounded lower level symbols is available. It is important to note that most, if not all, existing interactive learning approaches assume that the provided support is always correct, although it might be wrong due to noise or malicious intent of the tutoring agent. In contrast to interactive-learning-based approaches, unsupervised grounding approaches do not require any form of supervision and learn the meaning of words across multiple exposures through cross-situational learning [
36,
37]. The main advantage is that no tutor is needed, which makes them more easy to deploy and also removes a potential source of noise because it cannot be guaranteed that another agent that is able and willing to act as a tutor is present, nor can it be assumed that the received support is always correct. Both points are important when deploying an agent in a dynamic uncontrolled environment that does not allow any control over the people who interact with the agent. In previous studies, cross-situational learning has been used for grounding of shapes, colors, actions, and spatial concepts [
18,
20,
21]. However, most proposed models only work offline, i.e., perceptual data and words need to be collected in advance, and the employed scenarios only contained unambiguous words, i.e., no two words were grounded through the same percept. In contrast, the grounding framework used in this study, which is based on the framework proposed in [
38], is able to learn online and in an open-ended manner, i.e., no separate training phase is required, and it is also able to ground synonyms, i.e., words that refer to the same concrete representations in specific context, e.g., “happy” and “cheerful”.
5. Baseline Framework
The baseline framework uses the same percept extraction and classification components as the proposed framework (
Section 4.1 and
Section 4.2), while it uses a probabilistic model to ground words through their corresponding concrete representations. The latter is described in detail in this section.
The probabilistic learning model is based on the model used in [
19]. The model has been chosen as a baseline because similar models have previously been employed in many different grounding scenarios to ground a variety of percepts, such as shapes, colors, actions, or spatial relations [
18,
19,
20,
21]. In the model (
Figure 2), the observed state
represents word indices, i.e., each individual word is represented by a different integer. The following two example sentences illustrate the representation of words through word indices: (the,
1) (man,
2) (is,
3) (very,
4) (happy,
5) and (the,
1) (woman,
6) (is,
3) (really,
7) (cheerful,
8), where the bold numbers indicate word indices. Although “very” and “really” as well as “happy” and “cheerful” are synonyms in the context of this study (
Table 2), they are represented by different word indices. The observed state
t represents the type of emotion,
s represents the strength or intensity of the emotions and
g represents genders.
Table 3 provides a summary of the definitions of the learning model parameters. The corresponding probability distributions, i.e.,
,
,
,
,
,
,
,
,
,
,
,
,
,
t,
s, and
g, which characterize the different modalities in the graphical model, are defined in Equation (
1), where
GIW denotes a Gaussian Inverse-Wishart distribution, and
N denotes a multivariate Gaussian distribution. Gaussian distributions are used for
t,
s, and
g because concrete representations are represented by one-hot encoded vectors.
The latent variables of the Bayesian learning model are inferred using the Gibbs sampling algorithm [
44] (Algorithm 3), which repeatedly samples from and updates the posterior distributions (Equation (
2)). Distributions were sampled for 100 iterations, after which convergence was achieved.
Algorithm 3 Inference of the model’s latent variables. was set to 100. |
- 1:
procedureGibbs sampling(W, P, WP, AW) - 2:
Initialization of - 3:
for to do - 4:
- 5:
end for - 6:
return - 7:
end procedure
|
6. Experimental Setup
The proposed framework (
Section 4) is evaluated through human–agent interactions in simulated situations created using the RAVDESS dataset [
40], which consists of frontal face pose videos of twelve female and twelve male north American actors and actresses, who speak and sing two lexically matched sentences while expressing six basic emotions, i.e., happiness, surprise, fear, disgust, sadness, and anger [
45], plus calmness and neutral, through their voice and facial expressions. In this study, only the speaking records of the six basic emotions and neutral are used. In addition to the expressed emotions, all videos in which one of the six basic emotions is expressed also come with labels indicating the intensities of the expressed emotions, i.e., normal or strong, which are used in this study to train the emotion intensity recognition model (
Section 4.2). Since the videos of eighteen actors and actresses are used to train the percept classifiers (
Section 4.2), only the videos of six actors and actresses are used to create situations for the simulated human–agent interactions, leading to a total of 312 situations, i.e., for each person, eight videos per basic emotion (four for each intensity level) and four videos for neutral (only one intensity level). Each situation is created according to the following procedure:
The video representing the current situation is given to OpenEAR, which extracts 384 features (
Section 4.1);
156 (MFCC and PCM RMS) of the 384 features are provided as input to the employed deep neural networks to determine the concrete representations of the expressed emotion, its intensity and the gender of the person expressing it (
Section 4.2);
The concrete representations are provided to the agent together with a sentence describing the emotion type, intensity and gender of the person in the video, e.g., “She is very angry.”;
The agent uses cross-situational learning to ground words through corresponding concrete representations (
Section 4.3 and
Section 5).
Each sentence has the following structure: “(the)
gender is (
emotion intensity)
emotion type”, where
gender,
emotion intensity and
emotion type are replaced by one of their corresponding synonyms (
Table 2). If the emotion type is “neutral”, the intensity is always normal; thus, the sentence does not contain a word describing the intensity of the emotion and no corresponding concrete representation is provided to the agent. Additionally, if the gender is described by a noun, i.e., “woman” or “man”, it is preceded by the article “the”. Since the words used to describe a situation are randomly chosen from the available synonyms, how often each word occurs during training and testing varies, e.g., “cheerful” appears nearly twice as many times during training and testing as its synonym “happy” (
Figure 3). Ten different interaction sequences, for which the order of the situations was randomly changed, are used to evaluate the grounding frameworks to ensure that the obtained results are independent of the specific order in which situations are encountered. The proposed framework receives situations one after the other as if it is processing the data in real-time during the interaction, while the baseline framework requires all sentences and corresponding concrete representations of the training situations to be provided at the same time. Therefore, two different cases are evaluated. First, the case in which all situations are used for training and testing, because this allows the proposed framework to continuously learn, while it is an unrealistic case for the baseline framework because it is very unlikely that all situations have already been encountered during training. Second, only 60% of the situations are used for training, which is more realistic for the baseline framework, while it adds an unnecessary limitation to the proposed framework by deactivating its learning mechanism for 40% of the situations, although it does not require an explicit training phase.
7. Results and Discussion
The proposed cross-situational learning based framework (
Section 4) is evaluated through a simulated human–agent interaction scenario (
Section 6) and the obtained grounding results are compared to the groundings achieved by an unsupervised Bayesian grounding framework (
Section 5). Since the same percept extraction and classification components (
Section 4.1 and
Section 4.2) are used for both frameworks, any difference in grounding performance can only be due to the different grounding algorithms described in
Section 4.3 and
Section 5.
Figure 4 shows how the mean number of correct and false mappings obtained by the proposed framework changes over all 312 situations. It shows two different cases, which differ regarding the concrete representations used for emotion types, i.e., for the first case (TPRE), the predicted concrete representations are used, while for the second case (TPER), perfect concrete representations are used to investigate the effect of the accuracy of the concrete representations on the grounding performance. For TPRE, represented by continuous lines, the number of correct mappings quickly increases from zero to about twelve mappings for the first 20 situations, and continuous to increase more slowly afterwards to 15 mappings, while the number of false mappings starts with about six mappings and increases over the course of 45 situations to 15 mappings, after which it slowly decreases to 13 mappings. The main reason for the large number of false mappings is that the concrete representations used for emotion types are highly inaccurate, with an accuracy of 59.6%, while, at the same time, 60% of the employed words refer to them. This assumption is confirmed when looking at TPER, represented by the dashed line, which shows the number of correct and false mappings when perfect concrete representations are used for emotion types, while the predicted ones are still used for the other two modalities, i.e., emotion intensity and gender. For TPER, the proposed framework obtains 17 and 20 correct mappings within the first 20 and 45 situations, respectively. If the framework is only allowed to learn during 60% of the situations, it obtains 21 correct mappings, while it obtains one more mapping, i.e., 22, if it continues learning for the remaining situations. In contrast, the number of false mappings increases slightly from five to seven from the first to the second situation, stays stable for about eight situations and decreases then continuously to two mappings after 60% of the situations have been encountered and one mapping after all situations have been encountered. Both cases together illustrate that the proposed grounding algorithm depends on the accuracy of the obtained concrete representations; however, it does not require perfectly accurate representations because it is able to obtain all correct mappings for the second case, although the concrete representations for emotion intensities and genders only have accuracies of 73.5% and 89.8%, respectively.
The figure also illustrates the online grounding capability and transparency of the proposed grounding algorithm because it updates its mappings for every new encountered situation and allows, at any time, to check through which concrete representation a word is grounded. The latter becomes important when the model is used in real human–agent interaction to understand and debug the agent’s actions, especially in cases where they might have been inappropriate. Since the baseline model requires an explicit training phase, no similar figure can be obtained. Thus, to compare the two models, the mappings of the proposed model are extracted after 187 and 312 situations depending on the used train/test split. In this study, two different train/test splits are used. For the first split (TTS60), only 60% of the situations are used for training and the remaining 40% for testing to investigate how well the models perform for unseen situations.
Figure 3 provides an overview about the average occurrence of each word in the train and test sets. Applying the learning mechanisms of the proposed model only for the first 187 situations is both unnecessary and unrealistic because it is able to learn in an online manner and does not require an explicit training phase; however, it has been done out of fairness to the baseline model because the latter sees also only 60% of the situations during training. In contrast, for the second split (TTS100,) all situations are used for training and testing to ensure that the proposed model can learn continuously, while providing an unrealistic benefit to the baseline model because it is very unlikely that it would encounter all situations already during the offline training phase.
Figure 5 shows the accuracies for each modality for both models and test splits, as well as the percentage of sentences for which all words were correctly grounded. Additionally, it also illustrates the influence of the accuracy of the employed concrete representations by showing both the results for the predicted concrete representations of emotion types (
Figure 5a) and when using perfect concrete representations for emotion types (
Figure 5b). The proposed model achieves a higher accuracy than the baseline model in all cases, i.e., for all modalities, train/test splits and both concrete representations of emotion types, except for emotion types, when the predicted concrete representations are used and all situations are encountered during training. In fact, for genders, the proposed model achieves perfect grounding due to the high accuracy of the corresponding concrete representations, i.e., 89.8%. The figure also confirms the results in
Figure 4 that the grounding accuracy improves with the number of encountered situations, which seems intuitive but is not necessarily the case, as shown by the results obtained for the baseline model, i.e., the latter obtained less accurate groundings for most modalities when using all situations for training and testing due to the larger number of situations in the test set. For the baseline model, using perfect concrete representations for emotion types increases the accuracy of the groundings obtained for emotion types and genders as well as the accuracy of auxiliary words, although the accuracy of the latter two only increases for TTS100, while the accuracy of the emotion intensity groundings decreases independent of the number of situations encountered during training.
Although the accuracies provide a good overview of how accurate the groundings for each modality are, they do not provide any details about the wrong groundings or the accuracy of the groundings obtained for individual words. Therefore,
Figure 6 shows the confusion matrices for all words and modalities, which illustrate how often each word was grounded through the different modalities and highlight two interesting points. First, both models show a high confusion for emotion types, i.e., all of them have non-zero probabilities to be mapped to concrete representations representing emotion intensities or genders, due to the low accuracy of the corresponding concrete representations for TTS60. The confusion decreases for TSS100, in which case most words converge to one modality for the proposed model, i.e., only “happy” and “sad” are still confused as a gender or emotion intensity, respectively. However, this does not lead to a substantial increase in grounding accuracy for emotion types because some words, e.g., “surprise” and “afraid”, converge to the wrong modality so that the probability to be mapped to a concrete representation of an emotion type decreases to zero.
Figure 7 shows confusion matrices of words and different concrete representations, thereby allowing to investigate whether the concrete representation a word is grounded through is correct, which might not be the case if there is a high confusion between concrete representations of the same modality. The fourth column, representing the emotion type neutral, is very noticeable in
Figure 7a,c,d because both models do not map any word to it, except for the proposed model and TTS60 (
Figure 7b); however, even in the latter case, the probability that the word “fine” gets mapped to it is very low because most of the time it is mapped to the concrete representation of the concept male (column 11). Otherwise, the results show that, for the proposed model, the confusion is normally across modalities and not between concrete representations of the same modality. In contrast, the baseline model shows strong confusions between concrete representations of the same modality, e.g., for TTS60 “happy” and “disgusted” are more often grounded through anger than happiness and disgust, respectively.
When considering the deployability of the proposed and the baseline framework, it is important to also analyse the required computational resources. The grounding experiments have been conducted on a system with Ubuntu 16.04, i7-6920HQ CPU, octa core with 2.90 GHz each, and 32 GB RAM. However, it is important to note that both frameworks are only utilizing a single core; thus, the same processing times would be achieved with a single core, if no other computationally expensive processes are running at the same time. The average time it took the proposed framework to process a new situation and update its mappings was 3 ms, while the inference time was only 56 μs. In contrast, one Gibbs sampling iteration of the baseline model took 647 ms. Since 100 iterations were used, the average training time (averaged across all 10 runs) for the baseline model was 65 s for all 320 situations, while the inference time was on average 7.45 ms for each situation. These results confirm that both framework can be used for real-time grounding applications, while only the proposed framework can be used for dynamic environments that require frequent updates of the models because the baseline framework requires already more than one minute to train on a relatively small number of situations, which also needs to be done in advance and is, therefore, not possible after deployment.
Overall, the evaluation shows that the proposed model outperforms the baseline in terms of auxiliary word detection and grounding accuracy as well as its abilities to learn continuously without requiring explicit training. The latter does not only make it more applicable for real-world scenarios but also more transparent, because it is possible to observe how a new situation influences the obtained groundings.
8. Conclusions and Future Work
This paper investigated whether the proposed unsupervised online grounding framework is able to ground abstract concepts, like emotion types, emotion intensities and genders, during simulated human–agent interactions. Percepts were converted to concrete representations through deep neural networks that received as input audio features extracted via OpenEAR from videos.
The results showed that the framework is able to identify auxiliary words and ground non-auxiliary words, including synonyms, through their corresponding emotion types, emotion intensities and genders. Additionally, the proposed framework outperformed the baseline model in terms of the accuracy of the obtained groundings, as well as its ability to learn new groundings and continuously update existing groundings during interactions with other agents and the environment, which is essential when considering real-world deployment. Furthermore, the framework is also more transparent, due to the creation of explicit mappings from words to concrete representations.
In future work, we will investigate whether the framework can be used to ground homonyms, i.e., concrete representations that can be referred to by the same word. Furthermore, we will investigate whether the framework can ground emotion types, intensities and genders, if multiple people are present in a video. Finally, we are planning to integrate the framework with a knowledge representation to explore the utilization of abstract knowledge to increase the sample-efficiency of the grounding mechanism as well as the accuracy of the obtained groundings, and enable agents to reason about the world with the help of an abstract but grounded world model.