Continuous Emotion Recognition for Long-Term Behavior Modeling through Recurrent Neural Networks †

: One’s internal state is mainly communicated through nonverbal cues, such as facial expressions, gestures and tone of voice, which in turn shape the corresponding emotional state. Hence, emotions can be effectively used, in the long term, to form an opinion of an individual’s overall personality. The latter can be capitalized on in many human–robot interaction (HRI) scenarios, such as in the case of an assisted-living robotic platform, where a human’s mood may entail the adaptation of a robot’s actions. To that end, we introduce a novel approach that gradually maps and learns the personality of a human, by conceiving and tracking the individual’s emotional variations throughout their interaction. The proposed system extracts the facial landmarks of the subject, which are used to train a suitably designed deep recurrent neural network architecture. The above architecture is responsible for estimating the two continuous coefﬁcients of emotion, i.e., arousal and valence, following the broadly known Russell’s model. Finally, a user-friendly dashboard is created, presenting both the momentary and the long-term ﬂuctuations of a subject’s emotional state. Therefore, we propose a handy tool for HRI scenarios, where robot’s activity adaptation is needed for enhanced interaction performance and safety.


Introduction
Nonverbal cues, such as facial expressions, body language and voice tone, play principal roles in humans' communication, transmitting signals of the individual's implicit intentions that cannot be expressed through spoken language. The above cues compose the emotional state which can be used to convey one's internal state throughout an interaction. Therefore, current research is investigating the development of empathetic robots capable of perceiving emotions as an attempt to enhance the overall performance of several human-robot interaction (HRI) scenarios [1,2]. The present study is anticipated to benefit the development of competent social robotic platforms [3,4] and enhance their applications in several recent real-world scenarios [5,6]. All the above render affective computing an emerging research field, which aims to address a wide set of challenges that play key roles in the development of human-like intelligent systems [7].
Based on the existing literature in the fields of psychology and neuroscience, one's emotional state can be described following two distinct representational approaches, namely the categorical and the dimensional approaches. The first approach, introduced by Ekman, suggests the following six universal basic emotions: happiness, sadness, fear, anger, surprise and disgust [8]. Following Ekman's model, several alternative works were developed for categorical emotion estimation, either by dropping the emotional classes of surprise and disgust or by introducing some secondary ones, such as hope and neutral state [9]. Since the vast majority of emotion recognition systems have adopted the Ekman's categorical approach, is highly desirable as they provide a more robust representation of the human face when the subject moves, a fact that is commonly observed during a natural interaction. In a previous work of ours, we discussed the benefits of understanding the long-term behavior of a subject throughout an interaction and proceeded with a first attempt of providing such an estimation [34]. The introduced system performed the following: (a) extracted the facial landmarks from a subject; (b) used a DNN architecture to predict the values of arousal and valence; (c) built a long-term overview of their behavior according to emotional variations during the interaction; (d) displayed both the momentary and the long-term estimated values on the two-dimensional unit circle through a user-friendly dashboard. The present paper extends the above work, providing the following qualities: • Enhanced continuous emotion recognition performance, employing recurrent neural network (RNN) architectures instead of the DNN ones; • Competitive recognition results compared with the state-of-the-art approaches in the field, following the more strict and realistic leave-one-speakers-group-out (LOSGO) evaluation protocol [35]; • Implementation of an efficient and user-friendly human behavior modeling tool based on the experience gained through interaction.
The remainder of the paper comprises the following structure. Section 2 lists the utilized materials and methods of the system, namely the dataset adopted for the experimental studies and the evaluation of the system, as well as the modules that constitute the final total system. In Section 3, we display the ablation and experimental studies conducted to conclude an efficient emotion recognition system and the validation procedure followed to assess its final performance. Section 4 provides an extensive discussion regarding the application of the proposed system and its importance in HRI and assisted living environments, while Section 5 concludes with a summary of the paper, and Section 6 discusses interesting subjects for future work.

Materials and Methods
This section describes the database used to train the deep learning models. In addition, we discuss the tools that constitute the overall emotion estimation system.

Database
During our experimentation, we employed the Remote Collaborative and Affective (RECOLA) database [36], which includes a total of 9.5 h multimodal recordings from 46 French participants. The duration of each recording was five minutes with the subjects attempting to perform a collaboration task in dyads. The annotation of all recordings was performed by 3 female and 3 male French-speaking annotators through the Annotating Emotions (ANNEMO) tool. The database includes many modalities, i.e., audio, video, electro-dermal activity (EDA) and electro-cardiogram (ECG) modalities. The provided labels include the arousal and valence values in the continuous space regarding the spontaneous emotions expressed during the interaction. In our work, we used the 27 subjects provided by the open-source database. We followed the standard evaluation protocol proposed in the Audio/Visual Emotion Challenge and Workshop (AVEC), 2016 [37], splitting the dataset into three parts of nine subjects each, i.e., training, evaluation and testing set.

Face Detection Tool
Aiming to aid the facial landmarks extractor, it is highly important to crop the input RGB image before, so as to remove the noisy background and pass only the facial image to the extractor. The above step was necessary to avoid several errors in the feature extraction process, such as the one illustrated in Figure 1a, where the noisy background of the video frame leads to wrong keypoint extraction. Such an error can be efficiently removed, by previously detecting and cropping the face of the participant before passing it to the landmark extractor. For this purpose, the well-established feature-based cascade detector [38], employed in our previous work, detects and crops facial images from the captured video frames. The selection of the specific detector was based on both its simple architecture and its ability to sustain the real-time operation. The only difference between the introduced detection tool is a resize performed to each facial image after the extraction, so as to keep constant eyes distance at 55 pixels [30,39].

Facial Landmark Extraction Tool
The facial images, produced from the output of the tool described in Section 2.2, were fed into a facial landmark extraction tool. Following our previous approach, the tool's implementation was based on the dlib library, proposed by Kazemi and Sullivan [40]. The algorithm extracts landmarks from the mouth, nose and jaw, as well as the two eyebrows and eyes, as shown in Figure 2. The above procedure led to the extraction of 68 total facial points, each described, as usual, by its 2 spatial coordinates, x and y [41]. However, since the subjects had an attached microphone on the right side of their jaw, as shown in Figure 1b, the landmark detector commonly could not locate the points of the specific region. Thus, the entire facial region of the jaw was excluded from the extraction,while taking into account that the emotional state was not particularly conceived by the specific region. Consequently, the x and y values of the 49 resulting landmarks were kept in two vectors, l (t) x and l (t) y ∈ R 49 , forming the input at time step t to be fed into the emotion recognition tool. The difference of the current tool, compared with the extractor of our previous work, lies in the introduction of a landmarks standardization scheme. To that end, all detected features were forced to present zero mean value and standard deviation equal to 1 in both x and y dimensions. More specifically, we computed the mean (µ (t) x , µ (t) y ) and standard deviation (σ (t) x , σ (t) y ) values of each vector given a time step t. Then, the standardized values were computed, as follows: The above is proved to considerably aid the performance of the DNN model. Finally, the vectorsl (t) x andl (t) y are concatenated, forming the following vector: where l (t) ∈ R 98 .

Continuous Emotion Recognition Tool
The final component of our system is the continuous emotion recognition tool (CERT), which is responsible for estimating the valence and arousal values of the subject. Given a time step t, the extracted vector l (t) is organized into a sequence along with the l s − 1 previous vectors, where l s ∈ N * a hyper-parameter of the system to be empirically configured. This procedure was followed for each time step t, producing a final set of sequences with length l s . The above sequences, which constitute the input of CERT, were fed into an RNN architecture R N with N ∈ N * as the number of layers. Note that, due to their proven efficacy, we used LSTM cells [29] for our RNN architecture. Each layer can have distinct number of hidden units H n ∈ N * with n = [1, 2, ..., N], a number that also defines the layer's output dimension. Considering the above, an architecture is denoted as R N {H 1 , H 2 , ..., H N }. In that way, the output of R N becomes a vector of size H N . Given that there are two values to be estimated, we have H N = 2, implying an output vector o ∈ R 2 . This output was passed through a hyperbolic tangent activation function F = tanh, producing the final prediction p ∈ R 2 : The network's parameters θ R are optimized, trying to minimize the mean squared error (MSE) cost function: wherep i the corresponding ground-truth value. All experiments have been conducted using Python 3.9.7 and Pytorch 1.10.0 on an NVIDIA GeForce 1060 GPU, 6 GB. Each training procedure lasted 150 epochs with a batch size of 256, using a stochastic gradient descent (SGD) optimizer [42]. We used an initial learning rate at 10 −3 that decays by an order of magnitude after the 75th epoch.

Validation Strategy
As already mentioned in Section 1, we employed the LOSGO scheme to validate our models [35]. This is a more strict validation scheme compared with the one followed by our previous approach [34]. Considering this approach, the initial dataset was split based on the number of subjects, leaving one group of subjects only for evaluation and/or one for testing. Following the standard AVEC 2016 protocol, the dataset was divided into three parts of nine subjects each.

Results
In this section, we summarize the empirical study and the experimental results of our work. Firstly, by adopting a similar DNN architecture and l s = 1, we proved that RNN utilization benefits the system's recognition performance on two different architectures. Simultaneously, we searched for the optimal sequence length l s , evaluated on those two RNN architectures. Subsequently, we studied several versions of recurrent architectures to choose our best model. Finally, by exploiting the selected best model, we updated the framework of continuous emotion estimation and long-term behavior modeling, presented in our previous work. The updated framework was demonstrated in a similar way through a user-friendly dashboard, which visualizes the estimated momentary and long-term values of arousal and valence on Russel's two-dimensional circle.

Ablation Study
We begin with the comparisons between the simple DNN architecture used in our previous work against the recurrent one of the introduced approach. Note that, due to the different validation strategies, the obtained MSE values differ from those presented in our previous work. Hence, we replicated the experiments of DNN models to comply with the adopted validation scheme of this paper. The architectures used for the experimentation are depicted in Table 1. For a fair comparison, each LSTM layer was replaced by a fully connected (FC) one. We keep the same notation for the investigated DNN models adopting the symbol D. Hence, a DNN architecture is denoted as D{H 1 , H 2 , ..., H N }, with N ∈ R * as the number of layers and H n ∈ N * , n = [1, 2, ..., N] as the number of neurons of the nth hidden layer.
Subsequently, in Table 2, we depict the corresponding obtained MSE values for each architecture of Table 1. For each experimentation, we display the last MSE value, that is the value obtained after the last epoch as well as the best one achieved during the training procedure on the validation set of the RECOLA database. The reader can clearly understand the benefit of utilizing an RNN architecture instead of a simple DNN. Paying careful attention, we can observe that the overall performance improved considerably when a DNN was replaced by a corresponding recurrent architecture. All the above prove the constraint capacity of the system proposed by our previous work, while denoting the necessity of updating it with a more efficient one.

Sequence Length Configuration
Given the superiority of RNNs in the specific application, we proceed with an experimental study that deals with the definition of an optimal value for the sequence length parameter l s . We searched within the range of [5,50] with a step of 5. Similarly to our previous study, we kept both the best and the final MSE values of each training procedure. The obtained results are graphically illustrated in Figure 3. In the horizontal x-axis, we demonstrate the investigated l s values, while the vertical y-axis refers to the obtained MSEs. The blue color represents the final MSE values for each experimentation, whereas the orange represents the corresponding best ones. The above study was conducted both on R 3 {98, 128, 2} and R 3 {98, 256, 2}. We can observe that, for both architectures, high values of l s lead to better results. Yet, we have to keep in mind that the higher the l s , the more operating time the system requires. Hence, we search for a value that combines both low processing time and low MSEs. For better comprehension, we display the best MSE values of both architectures in Table 3. According to that, we selected l s = 35, since it seemed to present close-to-optimal recognition performance in both cases, while sustaining operating time at low levels.

Architecture Configuration
After the selection of the suitable l s value, we investigated several architectural variations of RNNs, taking into consideration both different number of layers and hidden units. We conducted several experiments using the experimental setup of Section 2.4 and the validation strategy of Section 2.5. After this point, we collected the top seven models, which are presented in Table 4. A quick overview of the table shows that the specific emotion recognition tool is more accurate when architectures with fewer hidden layers are used. Meanwhile, the final performance was not considerably benefited by increasing the number of the hidden units. Overall, we selected R 3 {98, 128, 2} as our best model. In Figure 4, we can observe, indicatively, the training curves of two of our best models.

Comparative Results
To place our emotion estimation system within the state of the art, we compare the obtained results of our best architecture against the corresponding ones achieved by proposed works in the field. A quick overview of the related literature shows that such a comparison is realized by employing the concordance correlation coefficient (CCC) metric. The above metric calculates the correlation between two sequences x and y, as follows: where µ x , µ y are the mean values, σ x , σ y are the standard deviation values and σ xy = cov(x, y) are the covariance of x and y, respectively. Thus, for a fair comparison, we calculate the CCC values between the predictions of R 3 {98, 128, 2} and the corresponding ground truth values for both the arousal dimension and the valence dimension. Note that the above estimations were performed on the validation set since this is the most common set used in the existing methods. The obtained results are collected in Table 5 along with the ones achieved by state-of-the-art works in the field. For better comprehension, we also display the features of the respective methods, i.e., geometric, appearance and raw RGB image. The obtained results reveal the competitive recognition performance of the introduced architecture both in terms of arousal and valence values. Paying more careful attention, the reader can observe that the methods that exploit geometrical features for the recognition of the emotional state reach a better estimation of the valence dimension compared with the arousal ones. Similarly, the proposed system appears to better conceive the valence values of the emotional state.

Continuous Speaker Estimation
Having defined the final architecture of CERT, we evaluate its performance in the testing speakers group of RECOLA. Given a speaker of this set, we exploit our best model to estimate the arousal and valence values during the interaction. The extracted values are organized in two separate one-dimensional signals through time and compared against the corresponding ground truth ones. For illustration purposes, in Figure 5, we display an indicative example of the comparison, where we can recognize the ability of the system to follow the ground-truth values. Considering the above, we can conclude that the adopted RNN architecture improves CERT's capability of perceiving the long-term variations of emotional state.
The calculated MSE values between the signals are presented in Table 6 for each subject of the validation set. The performance of the system remains at a competitive level for each speaker of the set. Meanwhile, the reader can observe the large difference in MSE values between the valence and the arousal coordinates. This owes to the fact that arousal is a more difficult dimension to be captured by visual data through geometrical features, as already stated in Section 3.4. Hence, audio input is often used to enhance the efficiency of an emotion recognition system [30]. In contrast, valence can be accurately captured through the extracted facial landmarks.

Discussion
In this section, we hold a conversation about the proposed system as a whole. More specifically, we demonstrate an updated version of the dashboard introduced in our previous work, focusing on a user-friendly and low-complexity solution. Subsequently, we discuss the beneficiary role of the system in application fields, such as HRI and collaboration tasks, as well as in more specific tasks, such as robots in assisted living environments.
To begin with, in Figure 6, an indicative graphical snapshot of the proposed tool is depicted. The left part of the dashboard provides a demonstration of the current frame of the processing video, where the estimated facial keypoints of the Facial Landmark Extraction Tool are projected on the image plane. We believe that the above part is crucial since on the one hand, it provides the user with a general overview, regarding the development of the processing procedure in the case of video processing, while in cases of real-time execution, the speaker is capable of continuously supervising their position to the camera and accordingly correcting their position and/or orientation, if needed. On the other hand, the projection of the extracted facial landmarks on the illustrated frame is also highly desirable since it provides feedback on the capability of efficiently tracking the interaction. Thus, the user is informed that the environmental conditions, such as illumination and background, as well as their position and point of view, allow the accurate surveillance of the system. At this point, we have to consider that the efficient extraction of the facial keypoints is of the utmost importance for the final performance of the system. Consequently, we deduce that the above part provides the user with a higher level of certainty, knowing that they can observe the general procedure and proceed to corrective actions.
The central part of the dashboard shows the CERT's momentary estimation of the speaker's emotional state for the specific frame depicted in the right part of the dashboard. The predicted arousal and valence values of the CERT are projected on the two-dimensional Russel's unit circle. Valence is represented by the horizontal axis (x-axis) and arousal by the vertical one (y-axis). Hence, the momentary emotional state occupies a particular point within Russel's circle. At the next time step, the new values of arousal and valence are calculated and projected, in the form of a new point. The result that the user observes is a point that continuously moves within the unit circle. Concentrating on the main contribution of this work, which constitutes the modeling of a speaker's long-term behavior during an interaction, the right part of the dashboard provides the pictorial result. A similar two-dimensional unit circle, from now on called a history circle, is utilized to provide the projection space of the estimated behavioral pattern. Valence and arousal are represented by the horizontal and vertical axes, respectively. Thus, at each time step, the point provided by the momentary estimation is incorporated into the history circle. The incorporation is realized by adding the current estimated value to the previous ones stored in the history circle. The stored values are previously multiplied by a discount factor d f = 0.9, thus fading the older estimations and paying more attention to the recent ones. The obtained illustrative result, shown in Figure 6, is a heatmap within the history circle. The lighter the value of the map, the more frequently the specific emotional state is expressed by the speaker throughout the interaction.
Bringing the presentation of the dashboard to a close, we discuss its benefits along with several fields of application in HRI and assisted living scenarios. With the last term, "robotics in assisted living", also known as aging in place [47], we refer to the specific field of research that focuses on the design, development and test of efficient robotic platforms, enabling elderly people to live and be served in their own houses [48]. The above entails the amenity of a wide variety of services from the side of the technology providers focusing on safety, health monitoring, supervision and assistance in everyday activities, i.e., cleaning, object movement, cooking, etc. [49,50]. Some technological solutions examined in this field include smart houses, wearable devices, ambient sensors and robotic platforms [51,52]. The main advantage of the robotic solution lies in the mobility that it provides, enabling the continuous supervision of the elderly, as well as its capability of proceeding to several actions when required [53]. However, the relatively low level of comfort that older people feel when they coexist with a robotic agent, that in some way inspects their movements, remains an open unanswered question. Therefore, the development of efficient tools that improve the capacity of the robotic agent to comprehend the state of the subject is highly desired, so as to cultivate a sense of familiarity [54].
Considering the above, the reader can understand our concern regarding the transparent operation of the proposed tool, under the prism of communicating the basic steps of its processing procedure to the interacting person. As far as the main task of the introduced system is concerned, namely the long-term behavior estimation, we envisage it under the aspect of user personality profiling. Considering the deviating personality patterns, the same momentary emotion expressed by two different subjects can imply totally different meanings regarding their internal state. An indicative example can be the different meanings of anger by an introverted and an extroverted person. Thus, the ability of humans to create a behavioral model of other people, gives us the ability to weight the impact of the expressed emotions. The proposed system provides the exact same capability since the creation of the subject's behavioral history enables the comparison of the contextually perceived momentary emotional state against the subjects behavioral pattern, leading to individualized conclusions regarding its internal state. The above can be used both for comparing momentary estimations against the user's complete behavior profile or against shorter behavioral patterns, such as daily mood, according to the nature of the interaction.

Conclusions
To sum up, the paper at hand proposes an advanced solution for estimating the momentary and long-term emotional state of a speaker during the interaction procedure, utilizing RNN architectures. Using face detection and landmarks extraction techniques, the most informative emotional features are extracted and fed into a suitably designed recurrent architecture. Our empirical study proves that its utilization considerably aids the estimation performance. Subsequently, it benefits the system's efficiency in creating an accurate behavioral model of the speaker. The above are summarized into an updated version of the graphical tool that communicates the basic steps and results of the process. Then, we discuss the importance of developing transparent and explainable tools that can understand and map the internal state of an interacting person, in order to build a relationship of familiarity and trust between them. The above is highly anticipated to improve the performance rates of the existing robotic platforms in the field of HRI, as well as improving humans' openness to confidently collaborate with robots. At this point, we particularly focus on elderly people because, on the one hand, they seem to be one of the more skeptical age groups, while on the other hand, the rising need for people to age in their familiar places reinforces the necessity of human-robot coexistence.

Future Work
As part of future work, we aim to incorporate the proposed system in a more realistic and complicated HRI scenario, such as a human-machine collaboration task, and evaluate its capacity of improving the performance of the scenario. Taking into consideration the discussion in Section 4, the above system is anticipated to be applied in a use case relative to fall detection for elderly people since it provides the opportunity to model fatigue, among other internal states, in a personalized manner. Moreover, the adoption of cuttingedge techniques, focusing on DNNs' representation learning capacity, can be tested to further enhance the system's recognition performance [55,56]. The above examines novel hidden layers and loss functions that improve feature learning capabilities of existing CNNs, providing more robust feature extractors [57,58].
Finally, as already stated in Section 1, the audio modality is not processed by our CERT mainly because we aim for a system capable of estimating human behavior throughout the whole interaction scenario, i.e., including nonverbal parts. As part of future work, a more sophisticated system can be investigated, capable of shifting from a visual to an audio-visual processing tool based on the speech of the person. The above is anticipated to improve recognition performance mainly in the arousal dimension, as proved by novel audio-visual approaches [10,30]. Funding: We acknowledge support of this work by the project "Study, Design, Development and Implementation of a Holistic System for Upgrading the Quality of Life and Activity of the Elderly" (MIS 5047294), which is implemented under the Action "Support for Regional Excellence", funded by the Operational Programme "Competitiveness, Entrepreneurship and Innovation" (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund).
Institutional Review Board Statement: Ethical review and approval were waived for this study, due to the fact that this research includes experiments with prerecorded datasets.

Informed Consent Statement:
This research includes experiments with prerecorded datasets and we did not conduct any live experiment involving humans.

Data Availability Statement:
In this research, we used a prerecorded database to train and evaluate the proposed visual emotion recognition system. The used RECOLA database is available online (https://diuf.unifr.ch/main/diva/recola, accessed on 16 January 2019) [36].

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: MDPI Multidisciplinary Digital Publishing Institute DNN deep neural network RNN recurrent neural network LSTM long short-term memory CERT continuous emotion recognition tool