Bringing Depth Data Alive : Conceive Human Intention through Web Visualisations of Head Pose and Emotion Changes †

Affective computing in general and human activity and intention analysis in particular, is a rapidly growing field of research. Head pose and emotion changes, present serious challenges when applied to player’s training and ludology 1 experience in serious games or analysis of customer satisfaction regarding broadcast and web services or monitoring a driver’s attention. Given the increasing prominence and utility of depth sensors, it is now feasible to perform large-scale collection of three-dimensional (3D) data for subsequent analysis. Discriminative random regression forests was selected in order to rapidly and accurately estimate head pose changes in unconstrained environment. In order to complete the secondary process of recognising four universal dominant facial expressions (happiness, anger, sadness and surprise), emotion recognition via facial expressions (ERFE) was adopted. After that, a lightweight data exchange format (JavaScript Object Notation-JSON) is employed, in order to manipulate the data extracted from the two aforementioned settings. Motivated by the need of generating comprehensible visual representations from different sets of data, in this paper we introduce a system capable of monitoring human activity through head pose and emotion changes, utilising an affordable 3D sensing technology (Microsoft Kinect sensor).


Introduction
Human intention analysis is a rapidly growing field of research, due to the constantly growing interest in applying automatic human activity analysis to all kinds of multimedia recordings involving people. Applications include the study of player's training and ludology experience in serious games, analysis of customer satisfaction regarding broadcast and web services, and the monitoring of driver's attention. Given the increasing prominence and utility of depth sensors, it is now feasible to perform large-scale collection of three-dimensional (3D) data for subsequent analysis [1,2]. In this work, we focus in particular on recognising head pose and facial expression changes which can provide a rich source of information that can be used for analysing human activity in several areas of human computer interaction (HCI).
Head pose estimation (HPE) refers to the process of deducting the orientation of a person's head relative to the view of a camera, or more scrupulously relative to a global coordinate system. Head pose estimation is considered as a key element of human behaviour analysis. Accordingly, it has 1 A borrowing from Latin word 'ludus' (game), combined with an English element; The term has historically been used to describe the study of games. received extensive coverage in the scientific literature and a variety of techniques have been reported for precisely compute head pose [3], while depth information can also be integrated [4][5][6].
Facial expression is one of the most dominant, natural and instantaneous means for human beings to communicate their emotions and intentions [7]. The reason for this lies in the ability of the human face to express emotion sooner than people verbalise or even realise their feelings. Humans are able to observe and recognise faces and facial expressions in a scene with little or no effort [8]. However, development of an automated system that performs the tasks of facial expression recognition (FER) is still regarded as a rather difficult process.
In computer and information science, visualisation refers to the visual representation of a domain space using graphics, images and animated sequences to present the data structure and dynamic behaviour of large, complex datasets that exhibit systems, events, processes, objects and concepts. Data visualisation is a relatively new field, which not only pairs graphic design with quantitative information, but also studies humans cognitive understanding and interpretation of graphical figures, aiming at conveying data in the most efficient, but accurate and representative way [9]. In addition, visualisations can be used in several distinct ways to help tame the scale and complexity of the data so that it can be interpreted more easily.
One field that has yet to benefit of data visualisations is human intention understanding. Following Jang et al. [10], human intention can be explicit or implicit in nature. Typically, humans express their intention explicitly through facial expressions, head movements, speech, and hand gestures. Interpreting the user's explicit intention, which contains valuable information, is vital in developing efficient human computer interfaces. In conventional human computer interface (HCI) and human robot interaction (HRI) environments, the user intention such as 'copy this file' or 'create a folder' can be explicitly conveyed through a keyboard and a computer mouse [11,12], which can be easily interpreted. The process of data visualisation is suitable for externalising the facts and enabling people to understand and manipulate the results at a higher level. Additionally, visualisations can be used in several distinct ways to help tame the scale and complexity of the data so that it can be interpreted effortlessly.
Most facial expression recognition and analysis systems proposed in the literature focus on analysis of expressions, although without any concern for subsequent interpenetration of human intentions with respect to the task of interest. Similarly, even though head pose changes provide a rich source of information that can be used in several fields of computer vision, there are no references in the literature regarding subsequent analysis of those findings for the task in hand. In the above context, the aim of the present work is to develop a framework capable of interpreting user intentions from depth data of head pose and facial expressions changes by visualising them on the web. Data visualisations can play an important role in conceiving user's explicit intention in many applications. One such application is the assessment of the player's training and ludology experience in the case of serious games such as [13,14]. The main hypothesis, in the context of serious games, is that an educator can intervene in the game characteristics in order to increase the learner's performance. The underlying assumption is that the educator can easily interpret the intention of all users after the experiments have concluded, and act accordingly. Accessible visualisations can play a major part in that kind of assessment by creating encodings of data into visual channels that educators can view and understand comfortably, while they can lead to valuable conclusions regarding the overall experience of users and serious games players.
The remainder of the paper is organized as follows. The next section contains a summary of related work. Section 3 gives an overview of the adopted methods for capturing head pose and emotion changes, alongside a detailed description of our modifications for the experiments. The seven proposed web-based visualisations are presented in Section 4, alongside their implementation details. Finally, Section 5 concludes and describes future research directions.

Related Work
A variety of methods and graphs have been used to represent detected emotions for research projects, as well as in the industry. However, since the results vary based on the included features, and the values that they are assigned to, a global representation method would be impracticable. Instead, the analysis produced can be delineated with a large diversity of depiction techniques.
One of the most prominent visualisation methods, with the assumption of a large quantity of experiments performed, is the use of a two dimensional line chart for a scheme of individual emotions [15]. This chart can show correlations between emotions and some patterns in the data that may be a product of common motifs in the user's inputs(these can be translated as particular tasks/actions that cause an equivalent human reaction). In addition, the overall emotional state of the user can be monitored with a line chart, as a surge (or equivalently a decline) in the data may show a differentiation from expected values [16]. In this way, not only the dominant mood of the participant can be perceived, but also, the way that the user is emotionally effected by activities or events. Furthermore, the use of the line-chart has also focused on interpreting real-time data [17] since it can aid on the demonstration of emotion and head position constant data flow.
Another informative visualisation that complements the aforementioned line chart, and can be used in conjunction with it, is the bar chart. This mean of illustration supports a complete view of individual instances and also a general picture to be determined, based on the predicted results [18]. Bar chart allows the demonstration of the most popular emotion in addition to a complementary variable (such as the time that the data was captured or the duration of the experiment). Also, considering this additional variable, the most probable emotion can be found by taking into account the class of the categorised reactions of the users, as well as the classification confidence. Moreover, variations in values signify the possible inconsistencies of the system, alongside examples or learning tasks which may have been more difficult to be classified. For example, considering that anger and happiness are completely opposite emotions, using two distinguish bars for visualising them, expected results would only be of binary distribution, with only one bar taking above zero values (the same can also be said for head poses, for example up and down).
Regarding the representation of head pose estimations, most illustrations are targeted towards displaying the level of accuracy at each prediction and the position where the example was found to occur in a 3D or 2D chart. A good demonstration of this mindset is the use of the perspective-n-point method [19] that 'simulates' the view of the user's head with a hexadecagon shape that can be rotated based on the direction of the head that was determined. However, this graphic is limited to the top view of the user's head and therefore would only provide useful information in the case of changes in the head pose horizontally. More recent efforts have focused on 3D depiction of the user's head position based on the pitch, roll and yaw as axis [20]. These three degrees of freedom allow to understand the position of the users head during experiments. The combination of these three values constitute each probable input head pose.

Overview of Methods for Capturing Head Pose and Emotion Changes
This section discuss the two methods employed in our experiments for recognising head pose and facial expressions changes, utilising an affordable 3D sensing technology (Microsoft Kinect sensor). The real-time head pose estimation and facial expression events are separately obtained for different users sitting and moving their head without restriction in front of a Microsoft Kinect sensor for specified intervals. Experimental results on 20 different users show that our modified frameworks can achieve a mean accuracy of 83.95% for head pose changes, and 76.58% for emotion changes when validated against manually constructed ground truth data.

Estimation of Head Pose Changes
Systems relying on 3D data have demonstrated very good accuracy for the task of head pose estimation, compared to 2D systems that have to overcome ambiguity in real time applications [21]. 3D head pose information drastically helps to determine the interaction between people and to extract the visual focus of attention [3]. The human head is limited to three degrees of freedom (DOF) in pose, expressed by three angles (pitch, yaw, roll) that describe the orientation with respect to a head-centered frame. Automatic and effective estimation of head pose parameters is challenging for many reasons. Algorithms for head-pose estimation must be invariant to changing illumination conditions, to the background scene, to partial occlusions, and to inter-person and intra-person variabilities. For performing the set of experiments, we partly followed the approach of Fanelli et al. [22] which is suitable for real time 3D head pose estimation, considering its robustness to the poor signal-to-noise ratio of current consumer depth cameras like Microsoft Kinect sensor. While several works in the literature contemplate the case where the head is the only object present in the field of view [23], the adopted method concerns depth images where other parts of the body might be visible at the same time, and therefore need to be disjointed into image patches either belonging to the head or not. The system is able to perform on a frame-by-frame basis while it runs in real time without the need of initialisation. An extracted patch from a depth image is sent through all trees in the forest. The patch is evaluated at each node according to the stored binary test and passed either to the right or left child until a leaf node is reached [5], at which point it is classified. Only if this classification outcome is positive (head leaf), a Gaussian distribution is recaptured and then used for casting a vote in a multidimensional continuous space which is stored at the leaf. Figure 1 shows some processed frames regarding two DOF (pitch and yaw). All calculations derived from the difference between the exact previous frame and the current frame, at each iteration of the experiment. The green cylinder encodes both the estimated head center and direction of the face.
right = yawDi f f < THRESH  Our aim is to capture all the changes concerning pitch and yaw angles which occur during the experiments. For this reason, given the pitch (pitch t ) and yaw (yaw t ) intensities of the ongoing streaming frame, and the exact previous frame's pitch (pitch t−1 ) and yaw (yaw t−1 ) intensities, the system operates in three steps as follows: (a) the differences regarding pitch and yaw are calculated by Equations (1) and (2) ; (b) then a threshold value was experimentally set in order for our system to ignore negligible head movements in all four directions tested; (c) finally, the changes with respect to the four different directions are given by Equations (3) to (6) .

Emotion Recognition from Facial Expressions
Emotion recognition via facial expressions (ERFE), is a growing research field in computer vision compared to other emotion channels, such as body actions and speech, primarily because superior expressive force and a larger application space is provided. Features which are utilised to classify human affective states, are commonly based on local spatial position or dislocation of explicit points and regions of the face. Recognition of facial action units (AU) is one of the two main streams in facial expression analysis. AUs are anatomically related to the reduction of specific facial muscles, 12 for upper face and 18 for lower face [24]. A total of 44 AUs can be derived from the face, and their combinations can compose different facial expressions. In this work, 4 basic universal expressions are considered: happiness, surprise, sadness and anger. An approach similar to Mao et al. [25] was followed for real time emotion recognition. Video sequences acquired from the Kinect sensor are regarded as input. The Face Tracking SDK [26], which is included in Kinect's Windows Developer toolkit, is used for tracking human faces with RGB and depth data captured from the sensor. Face detection and feature extraction are performed on each frame of the stream. Furthermore, facial animation units and 3D positions of semantic facial feature points can be computed by the face tracking engine, which can lead to the aforementioned emotion recognition via facial expressions. Face tracking results are expressed in terms of weights of six animation units, which belong to a subset of what is defined in the Candide3 model [27]. Each AU, that is deltas from the neutral shape, is expressed as a numeric weight varying between −1 and +1, and the neutral states of AUs are normally assigned to 0. Utilising Equation (7) , the AU's feature of each frame can be written in the form of a 6-element vector :ā where A1, A2, A3, A4, A5, and A6 refer to the weights of lip raiser, jaw lower, lip stretcher, brow lower, lip corner depressor, and brow raiser, respectively. Boundaries for each AU had to be empirically established in order to associate the vector obtained by the AU feature, as defined by (1)   which means showing teeth slightly, lip corner raised and stretched partly, and the brows are in the neutral position. Equations (8) to (11) were experimentally formulated for the test sessions. An example of all four different recognised emotions is shown in Figure 2.

Data Compilation and Experimental Setup
Regarding the storage of the obtained data, JavaScript Object Notation (JSON) format was used mainly because of its lightweight nature, convenience in writing and reading and more importantly, as opposed to other formats such as XML, its suitability in generating and parsing tasks in various Ajax applications as described in [28]. A record in an array was created for each user session, while an extra array was inside it, carrying three variables: time, direction and intensity for each movement that was detected as shown in Figure3. For facial expressions, a similar array was created, but in this case only two variables were listed: time and emotion, as shown in Figure 4.
In contemplation of assessing the validity of our modified versions of head pose estimation and emotion recognition, we performed the following experiments. First, the ground truth data had to be constructed, therefore  framework, analogous approach was followed, by asking subjects to make specific facial expressions, looking towards the direction of the Kinect sensor. Finally, the obtained results were put against the pre-assembled ground truth data. The experiments are controlled by a number of parameters. Some parameters were fixed intuitively during the establishment stage of the experiments, for example a threshold was set in order to split actual changes of the pose from negligible ones that can occur when a user moves his head in an uncontrolled environment. Both methods run by 30 fps at a computer with an Intel Core Duo CPU @ 3.00GHz.

Visualisations on the web
Although many different approaches have been proposed in the literature to solve the problem of recognising head pose and emotion changes, very few focus on how those data can be presented in order to deliver a useful interpretation effortlessly. To that end, the principal objective of this section is to introduce various efficient and user-friendly web-based visualisations 2 in order to improve the understanding and the analysis of human intentions from the captured data of head pose and emotion changes.

Head Pose
Four different visualisations are established for the desirable web-based data interpretation of head pose changes in Figures 5, 8, 10 and 11. The first one is a 2D scatterplot displaying the head movement of the user over specified time period. After that, a column visualisation depicting the overall head pose changes grouped by their dominant direction is presented. Finally, an intensity chart and a pie chart for outlining the intensities of head pose changes and their proportions in terms of the dominant direction are shown.

Head Pose Changes Across Time
Regarding the two-dimensional scatterplot, x-axis represents the time scale in seconds during which the tests take place ( Figure 5 shows only a zoomed portion of the whole graph), while each label in y-axis symbolizes each different user performing the test. Four different arrows imitate the movement of the human's head in two DOF. Furthermore, an additional feature is displayed when the mouse hangs over an arrow, showing the respective time each movement occurred and the intensity, which derives from the difference between the previous and the current frame was, as explained in Section 3.1. Apart from those elements, a colour fluctuation is also evident which serves as an intensity indicator for each movement (the closer to red color the arrow is, the higher the intensity of the movement). One can easily examine the motion of the player that way, alongside its intensity, 2 The code to reproduce all the visualisations are available at https://github.com/GKalliatakis/BringDepthDataAlive.  The second visualisation consists of a column diagram which illustrates the aggregation of all head movements grouped by direction every two seconds as shown in Figure 6. The four different directions are imitated by four different colors. In one hand, x-axis represents the time scale which is divided every two seconds until the end of the test. On the other hand, y-axis displays the number of movements for all the users that take part in the tests. Furthermore, when hovering above a column, the number of the corresponding direction summary is displayed. In this fashion, the dominant direction amongst all users every time interval is effortlessly assumed. Moreover, not so evenly distributed movements (e.g. columns between 2-4 seconds in Figure 6) can lead into practical conclusions taking into account the nature of the test as well. The full version of the overall head movement visualisation is available at: http://83.212.117.19/HeadPose3D/.

Intensities of Head Pose Changes
As mentioned before, the duration of the experiment can be an important aspect of the observations, as examples can be over-represented or under-represented in particular times during the tests. For this reason, an additional visualisation technique is used to exhibit any differences and inconsistencies. Considering these requirements, the best approach was the use of a intensity chart with the construction of a decagon, with each edge representing the different time values of the experiments, as shown in Figure 7.
This grid can display the movements that were classified by the system based on a time interval and also their average intensity at that time. Each edge of the decagon represents the average integer value (or the mean as well in this case) of each class. To emphasise on further user-interaction with the page, once the cursor hovers over one of the ten rectangles, a small legend appears holding additional information for the movement that was found. The legend holds the four labels present, as well as the precise intensity of each event in that period (up to the two most important floating points). These rectangles can be defined as the shape produced by the two halves of two consecutive sides, by the two radius from the centre of the shape to the end-points of the side-halves. In case of missing examples between two observations that are separated by a class with a reasonable amount of information, the class is displayed as a line from the centre of the shape until the related intensity is reached. This is done in order to indicate that there is no relation between the previous and the next class, since no examples of those two time-slots are present in the data. The shape as well as the legend are dynamically generated by the system and therefore can be manipulated if different data are do be used, as well as classes. The full version of the overall head movement visualisation is available at: http://83.212.117.19/IntensityGrid/.

Head Pose Changes Grouped by Proportion of the Direction
Another widely used visualisation is the pie chart. In order to accommodate both attributes of our experiments (time in seconds and the head pose direction estimated), the pie consists of two layers, as shown in Figure 8. This structure allows a broader view of the experiments and the data since a batch of variables is used instead of a single one. The internal circle consists of the time classes as been determined in previous illustrations as well. The information presented to the user at this level, is primarily focused towards the distribution of the data based on the experiment durations. Durations that are centred around smaller values are expected to hold larger confidence than others with larger time values. Therefore, when analysing the shape produced, the user would prefer to see a

Emotions
The web-base visualisations regarding the recognised emotions via facial expressions is assembled in accordance with head pose changes. For the case of interpreting emotions in the context of various applications, three different visualisations are introduced in Figures 9 to 11. First, a punch-card table is presented in order to represent emotion changes across the time intervals of our experiments. After that, a column visualisation depicting the overall facial expression changes grouped by the resulting emotion is presented. Finally, emotions grouped by specific time intervals are illustrated in the form of a class connection circle.

Emotion Changes Across Time
A straightforward way of representing the emotions detected by the main system in comparison to the date and duration of the experiment is the use of a two dimensional punch-card. The y-axis of the card is used to set the epoch of each experiment, while the x-axis specifies the duration that the experiment was conducted as time intervals, as shown in Figure 9. This allows the users to find the total number of tests that were carried out at a specified date, but also the order in which these experiments were conducted. The outcomes of the experiments ranged across four main classes (happiness, sadness, surprise and anger) with an additional 'combination' class which was used to represent the recognition of emotions in a pair (sadness and anger). Since this approach utilises the capabilities of data representation in 2D-space, a viewer can furthermore find the times that most observations occurred in the data provided.
The reason behind the choice of punch-card table is the fact that the user's emotions can be tracked through time and also motifs can emerge from the data. For example, by observing the punch-card it can be found which pairs of emotions are probable to occur together or are expected to be found. With the combination of the data about the tasks or actions preformed by the user, possible future emotional reactions can be predicted on related tasks. These patterns can be an essential part of the recognition process in a way that, they can show the emotion(s) that can be expected for a user to display at a particular time period and by performing a specific action, taking to account the previous emotion distributions. Furthermore, with respect to the task/event that was carried out by the user during an experiment, emotions that cause a large variation in the emotional state of the person, can be interpreted by the system and visualised as how the person reacts to the occasion. Moreover, if a combination of emotions is detected for a distinct experiment date, the data is shown as a gradient of the two emotions. This is done to distinguish (as it is also achieved by the bar chart) the cases in which the method used produces poor-correlation results. Therefore, if the recognition process produces a dual emotion class in which the two combinational emotions are not sufficiently related, to a certain degree, it could be interpreted as a poor choice of recognition methods. The full version of the emotion changes across time visualisation is available at: http://83.212.117.19/PunchCard/.

Facial Expressions Grouped by Emotion
The second visualisation consists of a column diagram (similar to the one used for head pose changes in Figure 6) which illustrates the aggregation of all facial expressions grouped by the resulting emotions for every two seconds. Figure 10 displays only two emotions, happiness and anger. However the rest of the recognised emotions can be set visible by clicking the corresponding check-box. The four different emotions are represented by four different colours. In one hand, x-axis represents the time scale which is divided every two seconds until the end of the test. On the other hand, y-axis displays the number of recognised emotions for all the users that take part in the tests. Furthermore, when hovering above a column, the number of the corresponding emotion summary is displayed. The full version of the visualisation concerning facial expressions grouped by emotion is available at: http://83.212.117.19/FacialExpression3D/.

Emotions Grouped by Time Intervals
The final depiction method chosen for understanding emotion changes is the class connections circle. Each emotion detected is represented as a point with a distinct color at the bottom quarter of the shape, while each experiment time is categorised similarly to the previous classes in the other visualisations. To show the recognised emotion during a period of an experiment, a line is drawn between the point that holds the time class in interest and the emotion assigned to it. The final graph produced by the process depicts the association between the durations and the emotions that were recognised at the time of the experiments.
This visualisation was implemented as an additional method for the emotion punch-card as the correlation between emotions could not be fully interpreted in that case. By combining these two illustration techniques, a possible user can have a wider view not only of the emotions recognised based on specific time intervals, but also how the emotions and durations of the experiments are correlated. The most essential information piece that can be portrayed in this graph is how different emotion classes coexist with time classes. This allows the user to understand if a particular incident occurring at a specific time (considering the fact that the data comes for identical experiments performed at each date), would have a positive or negative effect on the psychology of the person and the way that this  Figure 10. Column visualisation of two detected emotions, 'happiness' and 'anger' will transpire. For example, taking the 06.00 to 08.00 class, it is clear that the two emotions recorded, are related to negative/unpleasant emotions since the user was identified to be angry and surprised. Though the constant observation of the sentiments and the order they have been conducted in different durations, a general estimation of human reactions can be drawn based on the event that the person participated in the example was exposed to. In addition, the lines connecting the two points can also be viewed as the links between the durations and the emotions, and if the intensity is required to be shown as well, the line's colour may be based on a variable that is to determined in relation to the experiment's intensity. The full version of the emotions grouped by time intervals visualisation is available at: http://83.212.117.19/IntensityCircle/.

Conclusions and Future Work
To advance the field of human intention understanding, we need mechanisms capable of externalising the facts and enabling people to manipulate the findings of human activity monitoring tasks at a higher level. In this work, we propose seven different web-based visualisations that can help tame the scale and complexity of the depth data collected for the purpose of monitoring head pose and emotion changes. All visualisations, and other data, as well as the source code, are publicly available online. Future works include going beyond basic player monitoring to study if the actions taken by an educator have resulted in further changes in the mood of the players, in the context of serious game. Another direction would be to to analyse if the aforementioned mood changes would produce different results in the performance of the players.