Expressing Robot Personality through Talking Body Language

: Social robots must master the nuances of human communication as a mean to convey an effective message and generate trust. It is well-known that non-verbal cues are very important in human interactions, and therefore a social robot should produce a body language coherent with its discourse. In this work, we report on a system that endows a humanoid robot with the ability to adapt its body language according to the sentiment of its speech. A combination of talking beat gestures with emotional cues such as eye lightings, body posture of voice intonation and volume permits a rich variety of behaviors. The developed approach is not purely reactive, and it easily allows to assign a kind of personality to the robot. We present several videos with the robot in two different scenarios, and showing discrete and histrionic personalities.


Introduction
Human-robot interaction (HRI) is the study dedicated to understand, design and evaluate robotics systems to be used by or with humans [1,2]. HRI is a multidisciplinary field with contributions from multiple fields such as human-computer interaction, artificial intelligence, robotics, natural language understanding, or social sciences among others.
Social robots have emerged as a class of robots that require a highly evolved type of human robot interaction. These robots cannot be merely teleoperated and must posses skills that are beyond those present in cooperative robots, due to the challenges faced when developing social intelligence; robots that interact with humans should behave as we humans do.
Socially interacting robots must have abilities such as communicating using verbal (natural language) or non-verbal modalities (lights, movements or sound); expressing affection or perceiving human emotions; possessing distinctive personality; modelling human social aspects; learning; establishing social relationships [3,4]. Robots able to interact in such manners are being sketched in many applications such as caregivers of the elderly or of people with physical or emotional disabilities, in education, entertainment, and even in domestic scenarios [5][6][7][8][9][10][11].
Verbal and body expression in robots are thus of main concern while developing social interaction. This paper aims to link these two aspects of social intelligence by creating a system that coordinates the robot's body language with the nature of its discourse. By "nature" we mean the emotional aspect of the speech, as extracted by a text sentiment analyzer. The numerical output of the sentiment analyzer is then reflected in some degree in the movement or change of state of different parts of the robot body.
Therefore, the contribution of this paper is twofold: (1) The development of a talking behavior as a combination of talking beat gestures with emotional cues such as eye lightings, body posture of voice intonation and volume. At this stage, the sentiment analysis extracted by the text being vocalized by the robot is used as input to the talking behavior and each feature is affected by the polarity extracted by the sentiment analyzer. (2) Instead of being purely reactive, the developed approach easily affords modulating to the intensity or type of the actions which accompany the speech depending on the personality we would like to assign to the robot.
The rest of the paper is structured as follows. Section 2 reviews the literature of emotional and affective robotics. Section 3 describes the conducted approach, how beat gestures are generated using a deep generative approach, how different features are used to convey the sentiment extracted by a text sentiment analyzer; and how the robot state is affected by each reaction is explained. Robot personality adjustment is described in Section 4 and the resulting robot behavior is presented and discussed by means of videos in Section 5. Finally, Section 6 outlines the improvements proposed as further work.

Emotion Expression in Robots
Citing [12], "Creating and sustaining closed-loop dynamic and social interactions with humans requires robots to continually adapt towards their users' behaviors, their affective states and moods while keeping them engaged in the task they are performing". In this vein, the Affective Loop is defined as the interactive process in which the user of the system first expresses her/his emotions through some physical interaction involving her body and the system responds by generating affective expression which in turn affects the user making the user respond and step-by-step feel more and more involved with the system [13][14][15][16]. Thus, perceiving and showing emotions is essential to convey interaction.
Verbal communication is a natural way of interaction among humans. It is one of the communication channel most used by human beings, it is dynamic and is learned from childhood, even when we do not know how to write or read, we are able to communicate with others through words. Oral language allows us to transmit a message to the receiver, whether it be an opinion, an order, a feeling, etc. We express them through the articulation of a sound or group of sounds with different types of intonation, which can give a greater or lesser emotional charge to what is expressed. The way in which human voice can be modulated plays an important role in the communication of emotions. This process can be very complex and of the upmost importance in human-robot interaction, according to Crumpton and Bethel [17], who highlight the importance of vocal prosody.
However, non-verbal expression is key to understand sociability [18,19]. Some authors working with virtual agents and computer graphics have obtained impressive realistic animations of human characters. They are able to perform an accurate synchronization of the gesture behavior with the synthesized speech [20][21][22][23]. The demand for robots to behave in a sophisticated manner requires the implementation of capabilities similar to those typical of humans: sensing, processing, action and interaction. All of them have to take into account the underlying cognitive functions: motivations, emotions and intentions. Recently effort has been made in the search of behaviors that are able to convey sentiment. As the main mechanism to communicate emotions, facial expression plays a predominant role [24,25].
The robotic head Kismet [26] represents itself a milestone as how the human voice and facial features affect expressiveness. Furhat is another robotic head that shows similar characteristics, but it uses a back projected facial animation system for face to face interaction [27].
The advent of humanoid robots has encouraged researchers to investigate and develop body language expression in robots. The body language uses the gestures, postures and movements of the body and face to convey information about the emotions and thoughts of the sender, and can disclose as much information as words. For instance, Anki's Cozmo [28] is a tiny robot with impressive body expression [29]. Face emotion is shown in a small LCD screen. It moans and laughs. A kind of shovel that it uses for manipulation purposes adds arm level expression in a wheeled platform. At a different scale, Shimi [30], a smart-phone enabled robotic musical companion far from human morphology, expresses emotion rather differently, using a (faceless) body with a notably small number of DoFs.
Humans can also learn to associate colors with emotions and therefore could be used as another possible channel of communication when in conjunction with adequate cognitive models [31][32][33]. Color and light patterns can be modulated in a dynamic way to evoke happiness, sadness, anger or fear [34,35].
Different cultures or even individuals could interpret in a different way non-verbal behavior, but it is highly relevant for social interaction for all of them. When engaged in verbal communication with a robot, a person's trust is higher when the robot's gaze is in her direction [36]. In [37], authors propose a system in which the robot expresses itself through gestures in addition to speech, and in which the robot takes into account the human's reactions to adapt its own behavior. They then assess the perception of the person when compared with the speech-only behavior.
In [38] the same authors add facial expressions to their system. They report on an experiment where participants discuss with the robot videos chosen to induce some particular emotion, and the robot tries to adjust its behavior to the emotional content of that video.
Huang and Mutlu [39] conclude that all types of gestures affect user perceptions of the robot's performance as a narrator. Therefore, an important goal is to create a coherent gesture-speech narrative. In [40] a system is trained on a single-speaker audio and motioncapture data and it is able to generate both speech and full-body gestures from that input. A framework for speech-driven gesture production, designed to be applied to virtual agents with the aim of enhancing human-computer interaction is presented in [41].
The main contribution of this paper is the novel combination of all of the previous aspects in a socially interacting robot. In the following sections we introduce our approach to address the goal of generating an adaptive expression behavior for social robots.

Sentiment to Expression Conversion
As mentioned before, human talking expression can be affected by many factors. The mood, the interactional cues perceived or the character/content of what is being said is reflected in both face and body features. As we are concerned on the development of a natural talking robot behavior, at this stage we opted by analysing the effect of the nature of the verbosity, i.e., the text being pronounced by the robot. Thus, this nature of the verbosity, as a measure of the sentiment, will constitute the affective input of the emotion system of the robot and the measure used to modulate the robot gesturing features.

Affective Input
The literature reveals two types of emotional models. On the one hand, the theory of basic emotions divides emotions into discrete and independent categories , such as the six basic emotions (anger, disgust, fear, happiness, sadness and surprise) identified by Ekman [42].
Other affective models regard those experiences as a continuum of highly ambiguous states with also a high degree of interrelationship. They describe emotions as linear combinations of Valence-Arousal-Dominance (VAD) values. Valence is a measure of positivitynegativity of the stimulus, Arousal is related to the energy level and Dominance addresses the approachability of the stimulus.
These models allow for a wider range of emotions [43,44]. The goal of the research field called sentiment analysis is the analysis, from pieces of written language, of writer's attitudes, evaluations, emotions, sentiments, and opinions [45]. Its main purpose is to associate a given text with its polarity: positive, negative or neutral.
A brief review of the sentiment analyzers tested in this work are listed below. VADER (Valence aware dictionary for sentiment reasoning) sentiment analyzer [46], is a tool for sentiment analysis first designed for social media, but also applicable in different domains. It is based on rules derived from dimensional affective models, but also uses a lexicon. It analyzes the intensity and polarity of the written text and gives as output the proportion of text for each category and a compound score from the Valence scores of each word, according to the lexicon.
VADER is optimized for social media data. The wellknown NLTK library [47] also uses VADER as a sentiment analyzer tool.
TextBlob https://textblob.readthedocs.io/en/dev/ (accessed on 18 May 2021) is another tool for text analisys, written in Python. Its API provides sentiment analysis and another usual natural language processing (NLP) capabilities, as noun phrase extraction or part-of-speech tagging.
Similar to VADER, it is a rule-based system but slightly more limited in the sense that it does not take into account the punctuation marks. The sentiment analyzer outputs the polarity of the sentence together with the level of subjectivity. The advantage is that it can evaluate pieces of text instead of individual sentences.
The main weakness of the rule-based approaches is that the context is not taken into account, and only single words are analyzed. In order to address this issue, word representation in the form of embeddings places two words with similar meaning close to each other in a n-dimensional space. This approach is used in the flair framework [48]. Table 1 shows a comparison example of the sentiment scores obtained by the three mentioned tools. The sentences of the table correspond to the text chosen for the experimentation in Section 5. None of them take into account the semantics of the text as a whole, neither relate each sentence with the previous one.
The three options gave non-expected outputs, i.e., scores that would be very different if the text was manually annotated. The decision to use VADER for our experiments was made on the one hand, by observation and, on the other hand, by the flexibility to attenuate or enhance the nuance given to a sentence by the analyzer by adding emoticons or punctuation marks. This allows us to obtain a robot expression more in harmony with the non captured semantics of the text.
The Valence value places the emotion in the scale ranging from sadness to happiness, thus evaluating its positivity. VADER takes all the Valence scores from all the words, summarizes them, and returns a normalized sentiment in the range (−1, +1), being −1 the negative extreme and +1 the positive extreme. We then translate this VADER score from sentiment to emotion in the following manner: (Positive: VADER score ≥0.5, Neutral: −0.5 < VADER score < 0.5, Negative: VADER score ≤−0.5). In this work the emotion value is obtained by a direct translation from the VADER returned value (sentiment) into the emotion value in the sadness-happiness continuum.
As mentioned before, the sentiment analyzer evaluates each sentence individually. VADER was specifically attuned to sentiments expressed in social media and thereby for short sentences. As a consequence, it is not very sensible, tends to give exaggerated values and the neutral emotion rarely returns a non zero value within the −0.5 · · · 0.5 range. As an attempt to keep track of the sentiment of the text being verbalized, instead of using the raw compound score, we low pass filter the obtained value, that smooths the compound score over time.

Gesticulation Behavior
Kinesic communication is the technical term for body language, i.e., communicating by body movement. The word kinesics comes from the root word kinesis, which means "movement", and refers to the study of hand, arm, body, and head movements. Specifically, this section will outline the use of particular gesticulation; we will focus on the generation of beat gestures, i.e., conversational movements of body parts synchronised with the flow of speech but not associated with particular meaning [49]. This group of gestures mainly affects the upper body.
Our approach for talking beats generation consist of the automatic creation of movements that include arm, hand and head's joint positions in a timeline. Those gestures are generated by a Generative Adversarial Network (GAN) [50] model trained with data captured using a Motion Capturing (MoCap) system, that employs the Intel RealSense RGB-D camera and OpenPose [51] as skeleton tracker. The use of the MoCap to collect data from human speech allows us to capture the naturalness with which we gesticulate when talking and then transfer such properties to the robot. In this way, the gesture generation system allows the robot to generate novel sequences of gestures containing head yaw and pitch positions and arm joints information. A more detailed description about the gesture generation system can be found in [52].
For each sentence to be vocalized by the robot, its duration is calculated and the GAN system is required to generate motion (each movement consists on 4 poses) for that duration. Figure 1 summarizes the process followed to generate movements.

Affective Modulation
To appropriately express the emotion obtained from the text analyzer, the sentiment must be mapped into natural body gestures, enriched with facial expressions and voice intonation. This process is performed by the developed "Adaptive Expression" system, which is composed by three main modules: the "Gesture Generator", the "Eyes Lighting Controller" and the "Speech Synthesiser". These modules makes the robot adapt its behavior according to the values obtained from the "Emotion Apraissal" step. Each of the expression features can vary, according to the previous output, in a operational interval experimentally defined for each feature. Figure 2 summarizes the main components that compose the architecture of the system. A more detailed explanation on how this mapping process occurs is described in the following subsections.

Body Motion
The "Gestures Generator" module is in charge of the generation of talking gestures, converting the emotion value returned by the "Emotion Selector" module into a collection of gestures. These gestures are generated in such a manner that when executed during the speech, there is a good synchronization.
The speed of the gestures execution is adapted to the intended emotion, in order to better convey it. If the emotion is understood as "positive" the gesture will be executed more lively than when the emotion is depicted as "negative".
The emotion also affects the nod of the head of the robot. When a neutral emotion is portrayed, the robot head will simply look forward. However, the robot will tilt the head in other situations: in case of positive emotions the head will direct upwards, while with negative emotions it will go downwards. In order to obtain valid tilt angles the obtained Valence value will be normalized in the head minimum and maximum tilt angle range. In Figure 3 can be seen an example of GAN-generated gestures for three types of emotions: sadness (negative), neutral and happiness (positive).  Finally, the waist is biased to bend down or to stick the chest out, according to the emotion. Figure 4 shows the effect of backbone inclination adjusted by the waist joint.

Facial Expression
In the field of social robotics, researchers working in the design of robotic eyes have been inspired in the human face and, therefore, have tried to capture the human eyes' movements and shape. However, SoftBank's platform is limited in that sense due to the eyes design structure. In Pepper robots' eyes lie two LED rings which can be programmed in different manners. The color intensity can be changed, there is a choice of different color hues, and they can also be turned on or off during some time span.
Moreover, taking inspiration from facial expressions displayed in cartoons, the LEDs of each eye are grouped in three different patterns as shown in Figure 5. As it can be appreciated, to show different emotions only the half of the eye is used. This pattern showed to be socially better understood by the public than coloring the whole ring of each eye with the same color in the experiments performed in [53].
The conversion from emotion into facial expression is performed by the "Eyes Lighting Controller" module, which changes the color and intensity of the LEDs in the robot eyes. This module codifies the Valence value into a color in the form of a set of RGB values suitable to be displayed in the LEDs.

Voice Intonation
People modulates the intonation of their voices according to the context and also to add emphasis to the speech. Intonation is also correlated with the speaker's mood. In [54] the authors argue that the role of voice intonation in emotion expression is very important and show that when expressing anger, happiness or fear, the speech is uttered at a higher rate and pitch than when sadness or similar emotions are expressed.
We have associated the three emotions that our system implements to intonations portraying positive, neutral and negative moods. One of the limitations of the Pepper's speech synthesizer is that it does not provide a way to directly manage voice intonation. But it is possible to adjust some voice parameters, namely volume, speed and pitch. In our approach we change those parameters according to the emotion value of Valence.

Adaptive Personality
The behavior produced by the previous steps produce merely reactive expressions, i.e., each sentence's sentiment level is originated at the position where the previous sentence ended.
There is a correspondence to a humanlike manner of expressing text connotations in the sense that each action/perception we take has an impact in our mood or emotional state that, at the same time determines the intensity with which the next action is performed. However, the same text is always narrated with the same voice intonation and body gesturing except for the arms and head yaw motion generated by the GAN.
A straightforward modification allows the robot to show a constrained or exaggerated behavior, depending on the personality we want to assign to the robot. By applying sigmoid functions to each of the expression features on the compound score and adjusting each output to keep the operational interval of each feature, different levels of expression can be achieved, just by adjusting the gain of the exponential. Figure 6 shows the plots of the waist pitch angle and the speed reproduction of the generated arms movements for several exponential (K) values. The plots were recorded for the compound score given by VADER over "The Color Monster" tale.

Results
In order to show the obtained robot performance, two different scenarios have been defined to emphasize the different aspects of the modulated talking behavior.

Scenario 1:
In this scenario the robot pronounces the definition of the word rainbow extracted from Wikipedia. This text, which can be considered as neutral, has been manually annotated to force VADER to give negative, neutral or positive sentiment to allow all the sad-neutral-happy continuum to be compared. The main goal of this setup is to show how the adjustment of the exponential gain affects the personality of the robot for the three sentiments extracted by the analyzer. Two different videos allows us to compare the behavior of the robot while keeping the polarity of the text in a concrete state. The first video https://www.youtube.com/watch?v=8-t-URpHsiQhttp (accessed on 18 May 2021) corresponds to discrete personality (K = 1.5) while the second https: //www.youtube.com/watch?v=1ggNOlstFg4 (accessed on 18 May 2021), recorded with K = 4.0 shows a nervous or even histrionic behavior. Clearly, whatever the nature of the text, the histrionic version shows more extreme expression levels than the discrete one, that will indeed fit with a more natural behavior.

Scenario 2:
A different scenario is set in which the sentiment of the text varies over time and so they do the different talking features of the robot. Several passages from Anna Llenas' story entitled The Color Monster have been chosen to demonstrate the robot's performance. In this story, the protagonist is a sweet monster who wants to explain how he feels and uses colour to do so. Allegedly we can not detach the expected expression from what is being said, and this tale has shown to be a perfect tool to evaluate correspondences between meaning and generated expression. Definitively this is a more realistic scenario, appropriate for a storytelling robot. The recorded videos shows how the robot adapts its expression level according to each sentence, showing a progressive modulation of the different features again for the two identities, the discrete https://www.youtube.com/watch?v=eovp55f1jhs (accessed on 18 May 2021) and the histrionic https://www.youtube.com/watch?v=s0 1noQ1u1jM (accessed on 18 May 2021) one.

Conclusions and Further Work
In this work has been presented an adaptive system that implements a expression behavior for humanoid robots. Beat talking gestures are generated, along with other nonverbal clues, as head positioning, modulation of the voice tone, color of the eyes and speed of the movements of the arms. All of these features are related to the sentiment expressed in the sentences spoken by the robot.
On the basis of the obtained results, we consider that the presented approach gives good results and allows to emphasize or fade independent aspects of the robot's expression. It somehow facilitates the adaptation of the robot's personality according to the audience or event. Note that each feature is independent and thus, can be adjusted as such giving raise to multiple different expression behaviors.
VADER has shown to be a valuable tool but rather insensitive to subtle sentences. It behaves as it has been trained to, i.e, to categorize individual short sentences, and thus, it does not take into account either the relation among consecutive sentences or the semantic meaning. Training a more specific classifier should minimize unexpected results and eliminate the text annotation we do using emoticons to modify the values of the compound scores. Moreover, keeping the robot's emotional state as a function of other affective inputs could serve to dynamically adapt K to modify robot's mood in execution time.
Generated movements are executed by direct kinematics. This does not allow us to shorten the motion area of the arms and instead, the taken approach reduces the number of movements per second generated as a solution. The adjustement of the attainable configuration space would eliminate strange beat gestures when the chosen identity requires it.
Instead of relying solely in the sentimental analysis of the text being verbalized, richer and more insightful signals must be considered to commit with the affective loop requirements. Complying with a dimensional model will also enrich the emotion spectrum and extend the sadness-happiness continuum.
Finally, it would be possible to use multimodal data to train a generative model and produce affectively touched talking features. The comparison between the automatic behavior generated by the model and the one here proposed would be then mandatory. In any case, there is a need for a evaluation in a public performance in order to evidence that observers are affected by the emotional performance of the robot.