Capturing Conversational Gestures for Embodied Conversational Agents Using an Optimized Kaneda–Lucas–Tomasi Tracker and Denavit–Hartenberg-Based Kinematic Model

In order to recreate viable and human-like conversational responses, the artificial entity, i.e., an embodied conversational agent, must express correlated speech (verbal) and gestures (non-verbal) responses in spoken social interaction. Most of the existing frameworks focus on intent planning and behavior planning. The realization, however, is left to a limited set of static 3D representations of conversational expressions. In addition to functional and semantic synchrony between verbal and non-verbal signals, the final believability of the displayed expression is sculpted by the physical realization of non-verbal expressions. A major challenge of most conversational systems capable of reproducing gestures is the diversity in expressiveness. In this paper, we propose a method for capturing gestures automatically from videos and transforming them into 3D representations stored as part of the conversational agent’s repository of motor skills. The main advantage of the proposed method is ensuring the naturalness of the embodied conversational agent’s gestures, which results in a higher quality of human-computer interaction. The method is based on a Kanade–Lucas–Tomasi tracker, a Savitzky–Golay filter, a Denavit–Hartenberg-based kinematic model and the EVA framework. Furthermore, we designed an objective method based on cosine similarity instead of a subjective evaluation of synthesized movement. The proposed method resulted in a 96% similarity.


Introduction
Visual articulation of information through embodied behavior plays an important role in spoken social interaction [1]. Speech and gestures (including hand gestures, facial expressions, posture, and gazing) originate from the same representation, but are not necessarily based solely on the speech production process; i.e., "speech affects what people produce in a gesture, and that gesture, in turn, affects what people produce in speech" [2] (p. 260). In fact, more than 50 percent of visual articulation (i.e., embodied behavior) in spoken interaction adds non-redundant information to the common ground of the conversation [3]. Moreover, over 70% of the social meaning of a conversation or an interaction is transmitted through concepts other than words [4] (p. 27).
Recently, face-to-face interaction has been gaining attention, especially in interfaces where personalization is one of the key drivers, such as eHealth [5] and support for the elderly in their interaction with information technology [6]. Face-to-face interaction has been shown to elicit user engagement and stimulate the use of conversational interfaces [7], where the non-verbal, visual components drive the elicitation of affect and social awareness in human partners [8]. Overall, embodied conversational agents (ECAs) are becoming indispensable tools in personalizing and personifying everyday scenarios, where non-verbal behavior plays a crucial role in both representations of information and its understanding [9].
The paper is structured as follows. Section "Related Work" provides an overview of related work on gesture tracking and reconstruction techniques. Section 2 outlines the formalism used to implement the proposed tracking and reconstruction system. Section 3 shows the results of the system evaluated objectively on an embodied conversational agent EVA [21]. Section 4 provides the discussion, and the conclusions follow.

Related Work
Synthetic responses are often perceived as lacking in consistency, diversity, and vividness. Gestures perceived as "random", "non-aligned", or "no-sense" distort the perception of eloquence, competence, human-likeness, and vividness of conversational responses. Creating believable movements is challenging, since the movements must be meaningful and natural, reflecting the coupling between gestures and speech [22]. In some cases, it has been suggested that having no gestures will achieve better persuasion and perceived believability than inadequate gesturing [23]. The main drawback most systems face in the context of believability is how to cope with diversity, i.e., how to create a repository of non-verbal expressions large enough to adequately represent both intent and thought [21].
Much of the complexity associated with the reconstruction process is related to the "posing" phase, in which the animators must handle a large number of on-screen handles associated with the character's virtual skeleton. These handles allow direct or indirect modeling of the available degrees of freedom (DOFs) of all the individual character's joints [24]. Moreover, to generate believable animation using CAD environments, the animators must understand and tackle many modeling and animation techniques (e.g., polygonal modeling, modeling with NURBS or subdivision, UV skinning, forward and inverse kinematics, rigging, etc.). The resulting animations are realistic in a given context. However, they may decrease believability significantly when the internal fluidity (dynamics) is changed due to different intent or prosodic context. To increase the versatility of the gestural morphology and to decrease the complexity of designing the movements a conversational agent can reproduce, we propose to build a 3D corpus of gesture prototypes [20]. For the modeling environment, we exploited the DAZ3D studio, which simplifies the CAD controls, and, through a vast repertoire of conversational resources, minimizes the human animator's need for 'artistic' skills. Each prototype is expressively adjustable, and the EVA realizer [21] can mitigate the general issues of fluidity. However, since prototypes are generated manually by observing 'real-life' conversational expressions, the inner fluidity and the trajectory are not captured, but rather defined artificially through a set of orientational in-between points used to model the forward kinematics. As a result, the increasing complexity of gestures decreases the observed fluidity and perceived believability significantly.
To capture and preserve a high resolution of inner fluidity, performance-driven animation can be exploited to create 3D resources [25]. In this concept, an actor's physical performance is transferred interactively to the virtual character to be animated. The method requires specialized equipment (e.g., a sensor suit) and specially trained experts, making it highly expensive and less suitable for non-professional animators. The mapping between the performer's and character's motion is also a complex task, since both entities operate in different spaces. Thus, the process requires sophisticated configuration steps and automatic retargeting [26]. In [27], the authors outline a sophisticated system consisting of multiple cameras and passive sensors to compensate for the lack of naturalness and capture movement generated during the conversation. Most multi-view methods utilize multiple cameras and exploit the image depth and shape from silhouette cues to capture the moving actor [28], or reconstruct gestures via a multi-view photometric stereo approach [29]. These methods typically require a high-resolution scan of the person as an input.
With advances in deep learning (DL) and image processing and the availability of depth camera sensors, new opportunities arise that could enable end-to-end reconstruction and capture of 3D resources. Methods integrating Kinect or similar depth sensors [30][31][32] or multi-view data [33,34] achieve impressive reconstructions, but do not register all frames to the same canonical template, and require complicated capture setups. Moreover, to represent conversational movement viably, the captured resources must originate from reallife situations integrating spontaneous behavior [35]. Recreation, even when performed by professional actors, will always reflect artificialness, resulting in less spontaneous and less diverse responses [36].
If we want to create a sufficiently large inventory of gestures that will enable the generation of natural gestures in interaction, and if we want to achieve a time-efficient generation of such an inventory, it makes sense to use a multitude of existing video recordings, and, consequently, it makes sense to use a method based on the use of video materials recorded with one camera. Most conversational corpora consist of TV interviews and theatrical plays that have shown themselves to be an appropriate resource of spontaneous conversational expressions, and are significantly more suitable for research in wider 'discourse concepts' than any artificially recorded material [37]. Most methods related to 3D Pose and Shape Estimation from monocular sources refer to (Deep) Convolutional Neural Networks ((D) CNNs) and leverage 2D joint tracking and predict 3D joint poses in the form of stick figures [38][39][40][41][42]. The major challenge with deep learning and similar probabilistic approaches is that the tracking process involves predicting the most probable configuration of the artificial skeleton. Thus, the captured conversational movement will approximate something known to the model rather than an exact replication of what is observed. Overall, the DNN-based approaches work well within the constraints of the known context (i.e., a fixed environment and known classes). However, in uncertainty, the models tend to underperform and require retraining [43]. The inconsistency and uncertainty of deep models (e.g., Pose Net, Open Pose) in many cases result in issues such as incoherence in fluidity (e.g., sudden shifts) and over smoothing of actual movement [44], leading to a decrease in believability when replicated as part of synthetic conversational behavior.
With our main motivation in mind, i.e., to capture conversational expressions from monocular video as similar to the original as possible, and by preserving the 'prosody of movement', i.e., fluidity and dynamics, we designed a novel system, which consists of a Kanade-Lucas-Tomasi tracker (KLT) [45] to track the observed body parts based on optical flow, and a Denavit-Hartenberg-based kinematic model [46] to reconstruct tracked features as 3D templates and store them as part of the EVA's [20] motor skills repository. However, as with any image processing algorithm, mismatches in either tracking or reconstruction will always appear. Thus, in addition to the non-predictive method, it is crucial to have an objective measure to evaluate this effect. Instead of subjective evaluation of believability through perceptive experiments, as generally utilized in the field of embodied conversational agents, we propose a new method that allows for easy assessment of the mismatch generated in the tracking and reconstruction process.

Materials
The material we used in our work is a video signal with FullHD resolution and a different density of frames per second (frame rate). To test our system, conversational gestures were recorded with a video camera in a laboratory environment with a relatively impoverished background with only one actor. The resolution of the laboratory videos was FullHD (1920 × 1080) with H.264 compression and a frame rate of 30 FPS. In addition to laboratory videos, video clips from a video podcast were also selected; their content included spontaneously created conversational gestures and a diverse conversation with two performing actors. We used video content with a large number of spontaneously generated gestures. Such video content usually consists of videos with conversations. It is necessary to be aware that certain video content with professional actors (talk shows, evening news, etc.) does not offer a large amount of spontaneous and/or naturally created gestures. Professional actors know how to create conversational gestures that are not created spontaneously, but are acted out. We subjectively selected video content with gestures created spontaneously as our experimental example. The podcast videos were streamed from a social network, where they were published with the purpose of sharing video content. The resolution of the obtained video was also FullHD (1920 × 1080) with a different frame rate (25 FPS) than the laboratory video sources. Between 10 and 19 conversational gestures for each type of movement were analyzed, to evaluate the conversational gesture reconstruction from the EVAPose system. The analysis was performed for laboratory and spontaneously generated conversational gestures. We captured the spontaneous gestures from the video podcast, Gospoda [47].

Laboratory Set-Up
The core idea of the proposed method is to capture conversational expressions from different human collocutors engaged in interaction contained in the EVA-Corpus Video dataset, and store them back as "motor-skills" in the EVA-Corpus MotorSkills dataset, i.e., 3D artefacts to be re-used by embodied conversational agents during human-machine interaction. The workflow with individual steps is outlined in Figure 1. generated gestures. Such video content usually consists of videos with conversations. It is necessary to be aware that certain video content with professional actors (talk shows, evening news, etc.) does not offer a large amount of spontaneous and/or naturally created gestures. Professional actors know how to create conversational gestures that are not created spontaneously, but are acted out. We subjectively selected video content with gestures created spontaneously as our experimental example. The podcast videos were streamed from a social network, where they were published with the purpose of sharing video content. The resolution of the obtained video was also FullHD (1920 × 1080) with a different frame rate (25 FPS) than the laboratory video sources. Between 10 and 19 conversational gestures for each type of movement were analyzed, to evaluate the conversational gesture reconstruction from the EVAPose system. The analysis was performed for laboratory and spontaneously generated conversational gestures. We captured the spontaneous gestures from the video podcast, Gospoda [47].

Laboratory Set-Up
The core idea of the proposed method is to capture conversational expressions from different human collocutors engaged in interaction contained in the EVA-Corpus Video dataset, and store them back as "motor-skills" in the EVA-Corpus MotorSkills dataset, i.e., 3D artefacts to be re-used by embodied conversational agents during human-machine interaction. The workflow with individual steps is outlined in Figure 1. Workflow to capture conversational behavior in spontaneous discourse automatically, store it as gesture templates, and interconnect the captured templates with other verbal and nonverbal features of the observed sequence. The hand shape is not tracked; a CNN model was used to select the shape from a dictionary of possible shapes based on the HamNoSys notation system [48] The input to the proposed method was a color video stream, a conversational sequence contained in the EVA Corpus Video dataset. In the preparation phase, the bestfitting tracking points are selected automatically. The tracked points were filtered to reduce noise and possible inaccuracy in tracking-visualized as 'jitter' or sudden and Figure 1. Workflow to capture conversational behavior in spontaneous discourse automatically, store it as gesture templates, and interconnect the captured templates with other verbal and nonverbal features of the observed sequence. The hand shape is not tracked; a CNN model was used to select the shape from a dictionary of possible shapes based on the HamNoSys notation system [48].
The input to the proposed method was a color video stream, a conversational sequence contained in the EVA Corpus Video dataset. In the preparation phase, the best-fitting tracking points are selected automatically. The tracked points were filtered to reduce noise and possible inaccuracy in tracking-visualized as 'jitter' or sudden and instant jumps of the observed object from one position to another. The tracked geometry is sent to the Denavit-Hartenberg-based kinematic model and transformed into (Reconstruction in Figure 1) Euler angles (yaw, pitch, and roll) stored as a procedural animation (Create Resource in Figure 1). To validate the captured conversational expression (including the articulated shapes and inner fluidity) and compare it against the original, the expression was synthesized on our proprietary ECA realizer [21] by its in-scene recorder (Animation and Capture in Figure 1) functionality. If the synthetic system is recognized as similar (similarity index above 70%), the realization is registered as a possible visualization of the conversational concept. The following sections highlight the individual steps in more detail.

Preparation Step
Tracking arbitrary objects consistently and accurately in video sequences is challenging. Selecting robust features that best correspond to physical points and can, at the same time, be tracked well (e.g., mitigate occlusions, disocclusions and features that do not correspond to points in the world) is the first step in delivering an effective tracker. Shi-Tomasi's implementation [45,49] represents a robust method to select "good features" and can, at the same time, compensate for lack of naturalness if and when these features are lost due to occlusion or "loss of visibility"; a common occurrence when tracking the movement of hands in multiparty discourse. Unlike the Harris Corner Detector, the Shi-Tomasi implementation proposes a variation in the selection of corners, and proposes a pixel to be considered as a corner by comparing the eigenvalues, i.e.: where λ 1 and λ 2 are two eigenvalues of a symmetric matrix and λ is the predefined threshold. The pixel is considered a corner when both λ 1 and λ 2 are above the threshold. Figure 2 highlights the selection of the N strongest corners as defined by Shi-Tomasi (a) and selected tracking points (b), to be tracked and used as input in the reconstruction.
animation (Create Resource in Figure 1). To validate the captured conversational expression (including the articulated shapes and inner fluidity) and compare it against the original, the expression was synthesized on our proprietary ECA realizer [21] by its inscene recorder (Animation and Capture in Figure 1) functionality. If the synthetic system is recognized as similar (similarity index above 70%), the realization is registered as a possible visualization of the conversational concept. The following sections highlight the individual steps in more detail.

Preparation Step
Tracking arbitrary objects consistently and accurately in video sequences is challenging. Selecting robust features that best correspond to physical points and can, at the same time, be tracked well (e.g., mitigate occlusions, disocclusions and features that do not correspond to points in the world) is the first step in delivering an effective tracker. Shi-Tomasi's implementation [45,49] represents a robust method to select "good features" and can, at the same time, compensate for lack of naturalness if and when these features are lost due to occlusion or "loss of visibility"; a common occurrence when tracking the movement of hands in multiparty discourse. Unlike the Harris Corner Detector, the Shi-Tomasi implementation proposes a variation in the selection of corners, and proposes a pixel to be considered as a corner by comparing the eigenvalues, i.e.: where and are two eigenvalues of a symmetric matrix and is the predefined threshold. The pixel is considered a corner when both and are above the threshold.  Figure 2. Example of (a) Good features as extracted by the Shi-Tomasi detector and (b) Tracking points as the strongest corners in a specific region, representing the "tracked" joints in the human skeleton.
As outlined in Figure 2, the algorithm operates over grayscale images. The "quality" of corners (i.e., the threshold) is specified as a value between 0 and 1. All the corners below the threshold are rejected. Since we wanted to track only specific artefacts representing the shoulder, elbow, and wrist "joints", the user selects the regions of interest. Based on the manual selection of the region of interest, the algorithm selects the strongest corner automatically (i.e., the "green" circles in Figure 2b) as the final tracking point, and "rejects" all nearby corners of interest. In the tracking process, the tracking points are regarded as features.

KLT Feature Tracker
The KLT feature tracker [50] computes the displacement of features between consecutive frames by aligning a second image to an input image , where ( , ) represent the intensity of the image at [ ] . As outlined in Figure 2, the algorithm operates over grayscale images. The "quality" of corners (i.e., the λ threshold) is specified as a value between 0 and 1. All the corners below the threshold are rejected. Since we wanted to track only specific artefacts representing the shoulder, elbow, and wrist "joints", the user selects the regions of interest. Based on the manual selection of the region of interest, the algorithm selects the strongest corner automatically (i.e., the "green" circles in Figure 2b) as the final tracking point, and "rejects" all nearby corners of interest. In the tracking process, the tracking points are regarded as features.

KLT Feature Tracker
The KLT feature tracker [50] computes the displacement of features between consecutive frames by aligning a second image J to an input image I, where I(x,y) represent the intensity of the image at [x y] T . Let: where u represents the point at coordinates (x, y) in the first image I. The goal of tracking is to find point v in the second image J, where the displacement d is minimal; thus I(u) and The displacement d = d x d y T represents the image velocity (optical flow) at u. The minimal difference is computed as the mean squared error function: where w x , w y represent the integration window size parameter of the template window of size (2w x + 1) × 2w y + 1 .
The intensity of the image is represented by a small template window of n × n, centered at one of the feature points. J(x) is the same window in the next frame. d represents the displacement vector, and η represents the error introduced due to the shape change.
During the tracking process, the goal is to find point v in image J that corresponds to point u in the image I: where: δI represents the image difference. In standard optical flow computation, the goal is to find v opt as displacement v, which minimizes the matching function ε(v ): where optimum v is calculated through Taylor expansion: Because of the first-order Taylor approximation, this is only valid when the pixel displacement is small. Thus, the standard optical flow computation is performed in k steps and defined by: where v represents the final optical flow and d L the displacement, K the number of iterations to reach convergence, and new pixel displacement (i.e., one step in the LK optical flow computation) η k is defined as: Derivatives Ix, Iy in the image mismatch vector are computed at the beginning, and only δI is recomputed at each step k. The overall iteration completes when η k is smaller than the threshold, or the maximum number of iterations is reached.
The implementation of the KLT tracker used in our research is highlighted in Figure 3. The preparation step, definition of feature points and tracking points according to the defined process, is described in Section 2.1. Feature points were selected according to Shi-Tomasi's approach, and the tracking points were set to regions representing the "shoulder", "elbow", and "wrist" joints. We used a 5 × 5 points integration window. than the threshold, or the maximum number of iterations is reached.
The implementation of the KLT tracker used in our research is highlighted in Figure  3. The preparation step, definition of feature points and tracking points according to the defined process, is described in Section 2.1. Feature points were selected according to Shi-Tomasi's approach, and the tracking points were set to regions representing the "shoulder", "elbow", and "wrist" joints. We used a 5 × 5 points integration window. The tracking process implements the iterative optical flow computation and matches tracking features between the current ( ) and the previous (I) by tracking feature points, where is on the interval [1, n] and is the last frame of the video recording of the conversational sequence. If point in imagei that corresponds to point in cannot be found (i.e., the feature falls outside of the image), and/or the image path around the track point varies too much (i.e., the cost function is larger than The tracking process implements the iterative optical flow computation and matches tracking features between the current image i (J) and the previous image i−1 (I) by tracking feature points, where i is on the interval [1, n] and n is the last frame of the video recording of the conversational sequence. If point v in image i that corresponds to point u in image i−1 cannot be found (i.e., the feature falls outside of the image), and/or the image path around the track point varies too much (i.e., the cost function is larger than a threshold), the feature point is regarded as lost and is deleted. To recreate the expression, all tracking points must be registered in all frames, and the missing feature points must be replaced. The process exploits the Shi-Tomasi detector to create a new set of good features to replace lost features. The tracking points are relocated automatically to the closest new best feature point. The new feature points and 2D coordinates of each tracking point are saved, and the algorithm may proceed with the next frame. The tracking process completes when the last frame of the video is reached.

Filtering
The designed tracker implements "tracking-by-detection" and does not implement feature descriptors, such as SIFT [51]. This means that the tracking accuracy varies depending on the rotation, scale, and image perspective distortions (including lighting changes). The inaccuracy results in a "long-distance" move of a tracking point instead of a small shift in position (i.e., noise). While reconstructing the movement on the ECA, the jumps will be observed as instant "jumps"-movements which cannot be expected in real life. To avoid this, we implemented a digital filter based on the Savitzky-Golay algorithm [52]. The Savitzky-Golay filter belongs to the family of FIR filters, and provides an estimate of the derivative of the smoothed signal using convolutional sets derived from least-squares formulas coefficients. Savitzky-Golay filters minimize the least-squares error in fitting a polynomial to a sequence of noisy data. Consequently, the precision of data increases without distorting the signal tendency. Thus, the method is suitable for signal smoothing [53,54].
Let us consider the captured tracked points as a compositum of captured movements, i.e., the main signal f (l), corrupted randomly by distortions, i.e., w(l), thus the real signal (stream of tracking points) is defined as: where x(l) indicates the lth tracking point (i.e., the lth frame) in the signal with L points (sequence of data). The goal is to smooth the x(l) to reduce the level of the remaining noise to as low as possible, i.e., to minimize the following MSE: where the smoothing is carried out with a symmetric window with width N = 2M + 1 samples around the "reconstruction point". In this case, smoothing can be represented as a polynomial with the order n(P(i)) = ∑ n k=0 a k i k ; k = 0, . . . , n, and a k is the kth coefficient of the polynomial.
The filter output is then equal to the value of polynomial (n) in the central point y(0); y(0) = p(0) = a 0 . To calculate the next point, the window N is shifted by 1 unit. Savitzky and Golay [52] showed that the above process of 'filtering' is equivalent to convolving samples in windows with a fixed impulse response: To select the SG parameters used for smoothing optimally, we applied the power spectrum analysis. A power spectrum analysis was performed on an arbitrarily selected area of an unfiltered input signal. The analysis showed the level of the signal power spectrum and noise. Such analysis was also calculated for a filtered input signal with SG parameters of choice. In the area of the high frequency harmonics, we checked at what distance from the densification of the signal spectrum there was still a greater change in the spectrum of the individual filtered signal at the noise level. It was found that, with a window width of N = 9 and a polynomial degree 3, we achieved a sufficient smoothing effect. Figure 4 shows the power spectrum of the smoothed signal, the smoothed signal with a third-degree polynomial, and a window width of 9 is highlighted. It can be seen that the level of the power spectrum of the smoothed signal is less intense at slow transitions of the input signal than at faster ones. From this, we can conclude that the parameters selected preserve the signal's slower jumps, while the smoothing increases in the areas with faster jumps (a part of the signal with a higher frequency). Figure 5 shows the smoothing results using an SG filter with a window width N = 9 and a polynomial of the 3rd order. For the objective weight function w, a cubic function was used.
effect. Figure 4 shows the power spectrum of the smoothed signal, the smoothed signal with a third-degree polynomial, and a window width of 9 is highlighted. It can be seen that the level of the power spectrum of the smoothed signal is less intense at slow transitions of the input signal than at faster ones. From this, we can conclude that the parameters selected preserve the signal's slower jumps, while the smoothing increases in the areas with faster jumps (a part of the signal with a higher frequency). Figure 5 shows the smoothing results using an SG filter with a window width N = 9 and a polynomial of the 3rd order. For the objective weight function , a cubic function was used.

Denavit-Hartenberg Based Reconstruction.
As highlighted in Figure 1, the tracking points are sent to the reconstruction phase. In the first step, internal angles are calculated from tracking points using basic angle functions based on the player's position in the video scene. The internal angle is calculated between the starting point (in our case the player's shoulder on the video signal) and the end of the player's arm (end-effector) . In the case when the player is facing us, the internal angle is calculated as = arctan2 ( , ). For each joint that rotates, we can write = arctan2 ( , ). Signal tracking points for each joint are marked as . In this phase, 2D coordinates of tracked points are converted into Euler angles, which can be animated by our conversational agent. The arm is deconstructed into a manipulator consisting of three spherical joints: the shoulder, the elbow, and the wrist joint. Figure 6 outlines the designed manipulator. The end-effector is placed at the far end of the arm (e.g., the tip of the hand) as a reference point used in the automatic kinematic analysis algorithm.

Denavit-Hartenberg Based Reconstruction
As highlighted in Figure 1, the tracking points are sent to the reconstruction phase. In the first step, internal angles are calculated from tracking points using basic angle functions based on the player's position in the video scene. The internal angle is calculated between the starting point o a (in our case the player's shoulder on the video signal) and the end of the player's arm (end-effector) o e f . In the case when the player is facing us, the internal angle is calculated as q 8 = arctan2 o a , o e f . For each joint that rotates, we can write q i = arctan2(o a , o i ). Signal tracking points for each joint are marked as o i .
In this phase, 2D coordinates of tracked points are converted into Euler angles, which can be animated by our conversational agent. The arm is deconstructed into a manipulator consisting of three spherical joints: the shoulder, the elbow, and the wrist joint. Figure 6 outlines the designed manipulator. The end-effector is placed at the far end of the arm (e.g., the tip of the hand) as a reference point used in the automatic kinematic analysis algorithm.
functions based on the player's position in the video scene. The internal angle is calculated between the starting point (in our case the player's shoulder on the video signal) and the end of the player's arm (end-effector) . In the case when the player is facing us, the internal angle is calculated as = arctan2 ( , ). For each joint that rotates, we can write = arctan2 ( , ). Signal tracking points for each joint are marked as . In this phase, 2D coordinates of tracked points are converted into Euler angles, which can be animated by our conversational agent. The arm is deconstructed into a manipulator consisting of three spherical joints: the shoulder, the elbow, and the wrist joint. Figure 6 outlines the designed manipulator. The end-effector is placed at the far end of the arm (e.g., the tip of the hand) as a reference point used in the automatic kinematic analysis algorithm.  Figure 6. Visualization of the kinematic model, i.e., the arm manipulator consisting of complex cylindrical joints, implementing degrees of freedom of a spherical joint utilized in the skeleton of the realization entity. The goal is to determine the rotation of each joint in the mechanism of human arm movement from the positions of the tracking points. As outlined in Figure 6, the proposed kinematic model assumes each spherical joint is represented by multiple revolute joints that permit linear motion along a single axis. Using a single degree of freedom allows us to represent each angle of rotation of the spherical joint with a single real number, and the rotation in the spherical joint as a composition of single-axis rotation. This allows us to determine the position, and, more importantly, the orientation of tracking points in a systematic way, where the cumulative effect (A i ) is calculated using the Denavit-Hartenberg (D-H) convention [46].
We assumed the proposed model consisted of eight revolute joints and ten links. We assumed that joint i connects link i−1 with link i . Thus, when joint i was actuated, link i and further links in the kinematic chain of the robot arm moved. To perform the kinematic analysis, a coordinate frame was attached to each link, i.e., x i y i z i to link i , represented by the tracking point. Using the D-H convention, we assumed A i was a homogeneous transformation matrix which expresses the position and orientation of x i y i z i in respect to x i−1 y i−1 z i−1 . The A i varied as the configuration of the manipulator was changed; however, since we assumed the use of revolute joints, A i is a function of a single joint variable and can be represented as the product of four basic transformations, rotation, and translation around the z i−1 and x i axes: where θ i , d i , a i , α i are parameters associated with link i and joint i denoted as joint angle, link offset, link length, and link twist. The parameters mentioned above are shown in Figure 6. Since revolute joints were used, the A i−1 i is a function of a single variable θ i and the other three parameters are denoted as D-H parameters and are constant for a given link. The D-H parameters for each link of the proposed kinematic model are shown in Table 1.
−l x l z 0 0 10 −l y 0 0 0 Where link lengths a 1 , a 5 , −l x , −l y and link offsets d 4 , l y , were calculated as the average values of measurements performed over multiple human arms: (a n , d n , l n ) = ∑ n p 1 a n n p , ∑ n p 1 d n n p , ∑ n p 1 l n n p (16) where n p represents the number of candidates participating in the measurements (n p = 10) and the variable l n represents the end-effector position. For our case, we calculated the D-N parameters as a 1 = 30 cm , a 5 = 30 cm , d 4 = 30 cm, l x = 1, 5 cm, l y = 2.0 cm and l z = 12 cm. Using the D-H parameters in Table 1, we calculated the homogeneous transformation matrices A i−1 i for each joint, and created a reference transformation matrix for the forward kinematics of the proposed kinematic model: where the last two matrices A 7 8 and A 8 9 are the matrices of the end effector. However, it is not trivial to represent any arbitrary homogeneous transformation using only four parameters [46,55]. Given two consecutive frames, 0 and 1, with given coordinate frames x 0 y 0 z 0 and x 1 y 1 z 1 respectively, we assumed there exists an A 1 0 homogeneous transformation matrix. Moreover, we assumed the axis x 1 was perpendicular to z 0 , and x 1 intersects the z 0 axis. Under these conditions, there exist unique numbers a, d, θ, α, and Equation (15) can also be written as: where R describes the rotation matrix of joints and p describes the displacement. We can write the rotation in a homogeneous transform matrix as: where the angles α, β, and γ represent the Euler rotation. The names of each rotation angles are roll, pitch, and yaw. Since the embodied conversation agent EVA [21] has a default position different from our kinematic model, the calculated data must be adjusted to be suitable for the EVA-Script Template using a 90-degree rotation of α angle (roll): The above adjustment represents a method to normalize scene results to any given space of any embodied conversational agent with an underlying three joint-based skeletal structure. The calculated data can now be used for 3D conversational resources, as defined by the final component, i.e., the Resource Generator.

Resource Generator
In this step, the captured and reconstructed stream of Euler angles is transformed into a procedural animation, an EVA-Script Template compatible with the repository of "motor skills". In the EVA Framework, a conversational expression or gesture is defined as a function of conversational intent and its realization through visualization (i.e., movement model) [16], i.e.: whereĜ represents gesture T −1 m contextual interpretation based on conversational intent, andĤ is the movement model used to "visualize" the conversational intent. The movement model is then defined as a transition between the pose at the beginning (P S ) and the pose at the end (P E ), via a trajectory (J), carried out over time t.
Since the realization engine utilizes procedural animation and forward kinematics, the trajectory J is specified as a sequence of in-between frames, i.e., J = {P S+1 , . . . , P E−1 }. The attribute t is used to "optimize" the number of in-between-frames given the time constraint t, and the realizers targeted frame rate f : Here, N represents the number of in-between frames in J to skip, n J represents the total number of frames captured by J. We added 1 frame to preserve the number of in-between frames, since the first transformation from configuration P S−1 to P S (i.e., the start pose) is captured by J. The t × f represents the maximum number of frames the realizer can implement without any impact on the inner fluidity. Namely, adding too many in-between frames will "slow" down the gesture synthesis. The remaining series of configurations are, afterwards, transformed into an EVA template [22]. The overall process is outlined by the following pseudocode: The algorithm always takes the first and the last frame as the start and end poses, respectively, and every ith in-between frame, such as i mod N equals zero. Other frames are dropped. The function toUnit maps the 3D definition of each frame into the EVAScript's unit notation, i.e., {key, list} pair where key represents the articulated joint (e.g., collar, shoulder, elbow, forearm, or wrist) and list represents a sequence of 3D configurations for the joint (i.e., the sequence of yaw, pitch, roll configurations) that constitute the 3D transformation from pose i to pose j. The function createGesture then definesĜ as a procedural fragment written in EVAScript markup as shown in Figure 7.
append(toUnit(frame[size(frame) − 1]), H) createGesture (H, t) The algorithm always takes the first and the last frame as the start and end poses, respectively, and every th in-between frame, such as . Other frames are dropped. The function maps the 3D definition of each frame into the EVAScript's unit notation, i.e., {key, list} pair where key represents the articulated joint (e.g., collar, shoulder, elbow, forearm, or wrist) and list represents a sequence of 3D configurations for the joint (i.e., the sequence of yaw, pitch, roll configurations) that constitute the 3D transformation from pose i to pose j. The function createGesture then defines as a procedural fragment written in EVAScript markup as shown in Figure 7.  An example of procedural animation formulated in EVAScriptMarkup, each <sequence><parallel> represents configurations P i , P i+1 as the transition between two consecutive frames adjusted to the frame rate scaling. durationUp represents the duration of the transition, and is calculated as sizeo f (H) t and the value represents the 3D configuration of the "joint" (movement controller) in Euler angles expressed in roll-pitch-yaw (HPR) notation.

Results
In this section, we will present the content related to evaluating the reconstructed motion. Aspects will be presented of the evaluation of the reconstructed movement, evaluation procedures, and the characteristics used in such procedures. Here, we should keep in mind that the main objective is creating natural movement from sources whose content is naturally generated conversation. The existing corpora of reconstructed conversational gestures suffer from a relatively small number of artificially generated gestures. The use of social networks and their video content sharing allows us to obtain resources with the content of naturally generated conversational gestures. Although we can create a large number of reconstructed conversational gestures in this way, we have to evaluate these gestures objectively and subjectively. In our evaluation, we focused on objective evaluation, which allows us to remove poorly reconstructed gestures before evaluating them subjectively.
As already mentioned, the assessment of the similarity between actual and reconstructed movement in most cases comes from the context of use. In some cases, only the rough course of the movement may be of interest, while, in others, the accuracy of the reconstruction of the movement of each joint is necessary. Usually, a 3D reconstructed movement is considered to be similar to the real one if the only difference is a global geometric transformation such as translation or rotation, and the speed of conversational gestures can also be taken into account [56]. In the case of conversational gestures, such an aspect is not satisfactory, since logical and numerical similarities are missing. Aspects taking into account logical and numerical similarities define the logical similarities of movement as several versions of the same action or sequence of actions. In most cases, the algorithms used to evaluate logical and numerical similarities are based on quantitative characteristics [57]. However, it is necessary to consider that a logically similar movement can be numerically different, and vice versa. In many contexts of use, partial similarities are important, and we reconstructed the movement of some parts of the body differently from others. In this aspect, it is necessary to be aware that considering the extracted characteristics in the similarity measure of "unimportant" parts of the body can impact the result negatively [58].
In our case, the similarity between the input signal, tracked using the Kanade-Lucas-Tomasi tracker, and the reconstructed signal, tracked with the same tracker, was calculated using local similarity assessment features. This approach aimed to address the partial, logical, and numerical aspects of the similarity between reconstructed and actual human movement. The cosine distance features were used to compare the two signals on the entire time axis (marked as sim1) to assess the similarity. At the same time, the similarity of both signals was also assessed at selected time points, representing the moment of the conversational gesture position (sim2) with the greatest similarity. An important feature that must be considered in our case is the number of displayed images per second (frame rate) measured in FPS. With this feature, we added a new dimension to the similarity assessment, which captured the aspect of the global transformation with time to a certain extent. By taking into account the time feature, we can evaluate the directly reconstructed gestures quantitatively. We performed an experimental evaluation for two methods: OpenPose and our proposed EVAPose. Due to the fact that the OpenPose method failed to provide results for the original frame rate of the recording, we defined the condition sim2, which enabled us to compare the results of the methods at a similar frame rate. In this, we implemented the proposed EVAPose method in a way that it generated the same frame rate as the OpenPose method. A similarity assessment was made between the input video signal and the reconstructed signal using the proposed EVAPose and OpenPose systems. The similarity score marked with sim1 means the average of the scores of individual gestures. The sim2 rating contains the maximum value from the average of the ratings of individual gestures. Both systems measured both ratings. Similarity scores for the cases used in the evaluation are given in Table 2. The results of the reconstructed motion from video clips created in the laboratory are shown in the upper part of Table 2. The content of such videos was created to test the system, and does not show naturally generated conversational gestures. The lower part of Table 2 shows the results of conversational gestures, which are gesticulated in a spontaneous and natural way. Table 2 lists one of the tested sources with such content, where the content was, in most cases, a content-varied conversation between two actors with different and spontaneous gestures. The language used by the two actors was Slovenian. The results show how the proposed system compared with the OpenPose system [42]. Conversational gestures containing the most movement in the vertical and horizontal directions were selected for comparison. We considered such types of conversational gestures as the geometric basis for more complex gestures, which consist of a combination of horizontal and vertical movements.
To assess naturalness, we monitored events that occurred during the reconstructed gesture. We noticed that the signal from the OpenPose system did not have a smooth continuous transition compared to the EVAPose system. At some moments, an unnatural movement occurred, triggered by a sudden jump in the angle of a single joint. To detect such jumps, we analyzed the signal by differentiating the reconstructed signal (position as a function of time) three times ( Figure 8). The third derivative showed us the jerks (jumps) and their number. Figure 8 shows the third derivative of the reconstructed signal (position as a function of the number of samples) as a function of the number of samples.

Discussion
It can be seen from Table 2 that, in absolute terms, the similarity score using the OpenPose system was better than the score using the proposed system. The highest frame rate achieved by the OpenPose system was 11 FPS. To achieve comparability, we adjusted the frame rate of the proposed system by increasing the degree of the polynomial in the smoothing process (described in previous chapters). When we reached a frame rate of 19 FPS, the similarity did not improve anymore, but the computational complexity of the proposed method increased tremendously. The same gesture was reconstructed in all videos containing multiple vertical and horizontal movements. By reducing the frame rate in the EVAPose system, it was shown that the similarity had improved. In the case of horizontal movement from the video content created in a laboratory environment, the score exceeded that of the OpenPose system. The laboratory video content with a length of 46 s, which was given to the input of the proposed system, contained 10 gestures that were reconstructed completely successfully.
The results show that, despite the lower average score, a gesture was reconstructed better (in the sense of naturalness) with the proposed system than with the OpenPose system. The better reconstruction in terms of naturalness was manifested mainly in the jumps of the reconstructed signal. It can be seen from Figure 8 that the EVAPose system had a smaller number of jerks with a larger amplitude. It is also important that the EVAPose system jerks were not aligned with the OpenPose system jerks, which can

Discussion
It can be seen from Table 2 that, in absolute terms, the similarity score using the OpenPose system was better than the score using the proposed system. The highest frame rate achieved by the OpenPose system was 11 FPS. To achieve comparability, we adjusted the frame rate of the proposed system by increasing the degree of the polynomial in the smoothing process (described in previous chapters). When we reached a frame rate of 19 FPS, the similarity did not improve anymore, but the computational complexity of the proposed method increased tremendously. The same gesture was reconstructed in all videos containing multiple vertical and horizontal movements. By reducing the frame rate in the EVAPose system, it was shown that the similarity had improved. In the case of horizontal movement from the video content created in a laboratory environment, the score exceeded that of the OpenPose system. The laboratory video content with a length of 46 s, which was given to the input of the proposed system, contained 10 gestures that were reconstructed completely successfully.
The results show that, despite the lower average score, a gesture was reconstructed better (in the sense of naturalness) with the proposed system than with the OpenPose system. The better reconstruction in terms of naturalness was manifested mainly in the jumps of the reconstructed signal. It can be seen from Figure 8 that the EVAPose system had a smaller number of jerks with a larger amplitude. It is also important that the EVAPose system jerks were not aligned with the OpenPose system jerks, which can undoubtedly be attributed to the fact that the jerks are not the result of naturally generated jumps in conversational gestures, but are a statistical error in the algorithms.
The results show that, for the real case from the Gospoda dataset, the proposed system did not reach the similarity level achieved by the OpenPose system; however, the results were comparable. On the other hand, the EVAPose system can maintain the frame rates. In the case with an adjusted frame rate, the video content with a lower frame rate was chosen only to test the EVAPose system at different frame rates. Here, it is worth emphasizing that the preliminary laboratory experiments in the subjective similarity evaluation showed that the objective assessment could reach the subjective level of assessment in all cases.
Despite the better evaluation, the reconstruction with the OpenPose system did not provide a sufficiently large degree of naturalness. Even though the gesture was captured from video content in which a spontaneously created conversational gesture appeared, the OpenPose reconstruction system introduced an error in the reconstruction of the rotation of individual joints, which a human cannot create. We concluded that the cause of such errors comes from the learning process in the learning technologies. Unlike our system, the OpenPose system does not have clearly defined areas of rotation of individual joints in the human body. In our case, we defined the rotation areas of individual joints using the Denavit-Hartenberg notation before calculating forward kinematics, thus ensuring that unnatural joint rotations did not occur. The choice of forward kinematics and the notation of the kinematic model according to Denavit-Hartenberg also allowed a continuous transition of a specific movement.
As already stated, the assessment of similarity is relative in nature, because the systems can be used in different ways, in some of which the similarity assessment is important, and computational complexity of gesture reconstruction does not matter, nor how long the reconstruction process takes. In other cases, however, we wanted to preserve the frame rate, and can settle for a lower similarity score of the speech gesture reconstruction. In this case, we must be aware that the frame rate was preserved at the expense of lower similarity, which was conditioned by the acceptability of the gestures generated in this way, and by the perception of their naturalness and other important characteristics of successful spoken social interaction.