Selection of the Optimal Classification Model
For each of the neural network types mentioned in
Section 2.4, the following architectures were tested:
CNN networks containing from 2 to 3 convolution layers (each convolution layer was followed by a max pooling layer) followed by 1 to 2 dense layers, from 50 to 400 neurons for convolution and 50 to 200 for dense neurons;
RNN networks containing from 2 to 4 layers, built from 50 to 400 neurons;
RNN-LSTM networks containing from 2 to 4 layers, built from 50 to 400 neurons;
For all NN types, separate models were built increasing the neuron count on each layer by 25 for each new model (i.e., for RNN starting with a network containing 2 layers of 50 recurrent neurons and finishing with 4 layers containing 400 neurons).
Table 3 shows the results obtained using three types of neural network for the above mentioned datasets. For CNN, the best results were obtained for a network of 4 layers, 3 layers of convolution neurons 250, 250, 100 for each layer respectively and a dense layer of 100 neurons. For RNN best results were obtained for a 3 layer model with 3 recurrent layers of 300, 150, and 100 neurons. RNN-LSTM achieved best results for a 3 layer architecture of 250, 300, 300 neurons. In addition, all NNs had a single dense layer of 7 neurons as the output layer. We used 10-fold leave-one-subject-out cross-validation and repeat the process for 10 iterations, averaging the final score. All NNs were trained using ADAM [
46] for gradient descent optimisation and cross–entropy as the cost function, as it is a robust method based on well known classical probabilistic and statistical principles and is therefore problem-independent. It is capable of efficiently solving difficult deterministic and stochastic optimisation problems [
47]. Training was set to 500 epochs with an early stop condition if no loss decrease was detected for more than 30 epochs.
One can easily observe that the best results (69%) were obtained using RNN-LSTM on the
P set containing position of all skeletal joins (upper and lower body). In general, this set of features gives the best results for all types of networks (58.1% for CNN, 59.4% for RNN). This suggests that this kind of features provide the best description for emotional expressions from all considered feature types. In case of the
PU set, results for all networks are lower than 5%, which indicates the effect of the lower part of the body on recognition. Using orientation
O as a features set, even if complimenting the position (
PO or
POU), results in much lower recognition. According to
Table 3 results indicate a slight impact of error rate—better results were achieved using the 3 error rate almost for every dataset and NN, in few cases the results were equal. This may suggest that even a small movement or displacement can affect the recognition of emotions and the error rate of 5 cm might not be precise enough to represent all relevant movement data.
In addition, the experiment was conducted on sequences without the keyframing step in the pre-processing (containing all the recorded frames) for all NN models and all the datasets. The results of classification were 5–10% lower (depending on the model and set) than those acquired by key frames with error rate of 3 cm. Moreover, the time of NN training rose significantly due to a large increase in the data volume. Lower recognition results for sets without keyframing might have been caused by the Kinect v2 sensor noise, as the device output is not very precise and produces small variations in returned positions and orientations from frame to frame. This can be mitigated by applying filtering on the signal, however it is a time and computational consuming process, which does not fall into the assumption of reducing data pre-processing to a minimum. In our approach, the keyframing process allowed us to avoid the sensor precision related issues.
Performance of the proposed NN models was compared with the state-of-the-art NN architecture, ResNet. It has won several classification competitions, achieving promising results on tasks related to detection, localisation and segmentation [
48]. The core idea of this model is to use a so-called identity shortcut connection to jump over one or more layers [
49]. ResNets use the convolutional layers with
filters, which are followed by batch normalisation and rectified linear unit ReLU. Plenty of experiments showed that the use of the shortcut connections makes the model more accurate and faster compared to their equivalent models. We recreated the exact process as described in Reference [
48], as the results obtained for action recognition in Reference [
48] look very promising (accuracy over 99%) and as initially assumed, the method might be applicable for emotional gestures classification. The 3D coordinates of the Kinect skeleton (from our
P and
PU datasets) were transformed into RGB images. The sets were also augmented according to the description in the source paper. For our experiment, we prepared the testing and training set following the 10-fold leave-one-subject-out cross-validation method, meaning that the testing set did not contain the training samples and samples obtained from training set samples augmentation. Accuracy achieved using ResNet is significantly lower than that of the other NN types. This might be caused by the size of the original dataset, which contains only 474 unique samples and the process of argumentation presented in Reference [
48] does not produce a diverse enough set to train such a deep NN.
For each type of neural network, the best results are presented in a form of confusion matrix (see
Figure 7). One can observe that the best results were obtained for the neutral state as it differs greatly from other expressions (the actor/actress stood still, while there was a relatively bigger amount of movement while expressing other states).
Happiness, sadness and anger have a high rate of recognition and are sporadically classified as other emotions, as gestures in those three states are highly distinctive and differ from other emotional states (in terms of dynamics, body and limb positions and movement), even when the gestures are not exaggerated. Disgust and fear were confused with one another most frequently, this might be caused by the way they were performed by the actor/actress, as this confusion pattern is analogous for all three NN types. It is clearly visible on the recordings that those two emotions were acted out very similarly in terms of gestures (usually backing out movement with hands placed near head or neck for both states).
Since the recognition accuracy of the neutral class far exceeds other emotional states, as the samples for this state contain the least amount of motion and it differs from all the other states greatly, in the next step we analyse two sets without this class. From the first one we merely exclude neutral state, thus it consists of sadness, surprise, fear, anger, disgust and happiness. The second set contains emotional states, which are most commonly used in the literature: sadness, fear, anger, and happiness. Experimental results of the above-mentioned datasets are presented in
Table 4. As in the case of seven classes, the best results were obtained using
P set. Similarly, RNN-LSTM proved to be the most effective, providing 72% in case of 6 classes and 82.7% in the case of 4. Confusion matrices for the above-mentioedn sets are presented in
Figure 8 and
Figure 9.
In order to compare the proposed method with other classification methods, we calculated the most commonly used features, such as kinematic related features (velocity, acceleration, kinetic energy), spatial extent related features (bounding box volume, contraction index, density), smoothness related features, leaning related features and distance related features. During features extraction we strictly followed approach presented in Reference [
23], since the authors obtained very promising results on a database derived from Kinect recordings. We juxtaposed several well known classification methods to verify the above-mentioned features and their effectiveness in gestures-based emotion recognition. The obtained results are presented in
Table 5.
To determine the performance of the above-mentioned classifiers we used the WEKA [
50] environment. All parameters of the classifiers were set empirically in order to achieve the highest efficiency. As one can easily observe, the best results were obtained in the case of Random Forests. However, it should be emphasised that none of the methods listed above achieve better results than the proposed approach. This is a result of the generalisation of features from the whole recording, an approach which might be appropriate for simple gestures recognition; however, it becomes inaccurate for more complex and non-repeatable expressions.