The application of AR requires a large amount of data, collected from a diverse group of participants. Researchers have published datasets to enable the validation and comparison of results. The data sets can consist of posed, induced, and/or natural emotions. Datasets can be grouped based on content, data modality, and/or participants [
3]. Such datasets are composed of posed or spontaneous facial expressions, primary expressions or facial action units as labels, still images or video sequences (i.e., static/dynamic data), and controlled laboratory or uncontrolled non-laboratory environments.
2.2. State of the Art of RECOLA for Affect Recognition
In this section, we present relevant work from the literature on the prediction of arousal and valence values from physiological, visual, and multiple sensor sources, particularly focused on the RECOLA dataset.
Table 1 further reports the results from the literature we will discuss in this section. In our previous work [
13], we used physiological data (EDA and ECG recordings and their features) from the RECOLA dataset to predict the arousal and valence emotional measures. The EDA and ECG signals were processed and labelled with arousal or valence annotations and a series of regressors were tested to predict arousal and valence values. The optimizable ensemble regressor achieved the best root mean squared error (RMSE), Pearson correlation coefficient (PCC), and concordance correlation coefficient (CCC). The baseline results achieved by the individual models for the gold-standard emotion sub-challenge (GES) in 2018’s audio/visual emotion challenge (AVEC) [
14] are reported in terms of CCC. Their physiological results were obtained through an emotion recognition system based on support vector machines (SVMs), used as static regressors. For visual data, the authors of [
14] achieved the best CCC on arousal predictions, using a multitask formulation of the Lasso algorithm. They obtained the best results on valence predictions, using an SVM. Hierarchical fusion over the different data modalities was then applied via Lasso and multitask Lasso to improve the predictions of arousal and valence values.
Amirian et al. [
10] used random forests along with various schemes of fusion to predict arousal and valence values from RECOLA’s audio, video, and physiological data. Their best results were obtained by a combination of random forests and linear regression fusion over all modalities (audio, visual, and physiological). The End2You tool [
15] is a toolkit for multimodal profiling that was developed by the Imperial College of London to perform continuous dimensional emotion labels of arousal and valence values. It uses raw audio, visual information (i.e., video), and physiological ECG signals as input. The authors of [
15] predicted arousal and valence on RECOLA’s ECG signal and video recordings. Brady et al. [
16] used RECOLA’s physiological data and baseline features, as specified in AVEC 2016 [
11], to apply regression over arousal and valence values via a long short-term memory (LSTM) Recurrent Neural Network (RNN). They also extracted higher-level features from raw video and audio features using deep supervised and unsupervised learning, based on sparse coding, to ease the learning of the baseline SVM regressor. They used convolutional neural network (CNN) features to predict arousal and valence values from video recordings, using an RNN. Finally, they proposed predicting continuous emotion dimensions using a state space approach such as Kalman filters, where measurements and noise are handled as Gaussian distribution. This is to fuse the affective states (i.e., predictions) from the audio, video, and physiological data. According to [
12], the results obtained by the authors of [
16] are the best results obtained on RECOLA in the literature. As such, we will compare our results to theirs.
Han et al. [
17] used RECOLA’s visual features to predict arousal and valence values through an RNN. Weber et al. [
18] used visual features provided by RECOLA’s team in 2016 to perform regression via a SVM with late subject, multimodal fusion (at decision/prediction-level). The authors of [
19] exploited CNN features from RECOLA’s videos as well as an RNN to estimate valence values. CNNs have also shown promising results when used to perform AR. AlexNet [
7] was used in a number of studies to obtain deep visual features. AlexNet has also been applied on emotion recognition, where results demonstrated evident performance enhancements [
8,
9]. In this work, we exploit CNNs, such as ResNet and MobileNet, to predict continuous dimensional emotion annotations in terms of arousal and valence values from visual data.
Povolny et al. [
20] presented a multimodal emotion detection algorithm using audio, bottleneck, and text-based features as well as the features suggested by Velstar et al. [
11]. The set of visual features in [
11] were accompanied with CNN features, extracted from hidden layers, after training the CNN for landmark localization. The authors of [
20] proposed multiple linear regression systems, trained on individual feature sets, for predicting the arousal and valence emotional dimensions. In comparison to [
20], Somandepalli et al. [
21] used Kalman filters for decision level fusion. They first used support vector regression (SVR) to perform predictions from unimodal features, where predictions are noisy estimates of arousal and valence. The output of the SVR models was inputted to the Kalman filters for fusion. They later [
22] proposed facial posture cues and a voicing probability scheme to deal with the multimodal nature of the problem.
Table 1 summarizes the results of the above-mentioned state-of-the-art studies and further compares them to our results, as they will be presented in the remainder of the paper. A lower RMSE shows better performance. On the other hand, higher PCC and CCC show better performance.
Table 1.
Summary of results from the literature on prediction of arousal and valence values.
Table 1.
Summary of results from the literature on prediction of arousal and valence values.
Data Type | Prediction | Reference | Technique | Results (RMSE, PCC, CCC) |
---|
Physiological | Arousal | Current | ECG + EDA Optimizable Ensemble | 0.0168, 0.9965, 0.9959 |
[13] | ECG + EDA Optimizable Ensemble | 0.0154, 0.9976, 0.9967 |
[10] | ECG Random Forests | N/A, N/A, 0.097 |
[10] | EDA Random Forests | N/A, N/A, 0.074 |
[14] | ECG SVM | N/A, N/A, 0.065 |
[14] | EDA SVM | N/A, N/A, 0.029 |
[15] | ECG End2You | N/A, N/A, 0.154 |
[16] | ECG RNN | 0.218, 0.407, 0.357 |
[16] | EDA RNN | 0.250, 0.089, 0.082 |
Valence | Current | ECG + EDA Optimizable Ensemble | 0.0083, 0.9985, 0.9978 |
[13] | ECG + EDA Optimizable Ensemble | 0.0139, 0.9954, 0.9946 |
[10] | ECG Random Forests | N/A, N/A, 0.139 |
[10] | EDA Random Forests | N/A, N/A, 0.206 |
[14] | ECG SVM | N/A, N/A, 0.043 |
[14] | EDA SVM | N/A, N/A, 0.058 |
[15] | ECG End2You | N/A, N/A, 0.052 |
[16] | ECG RNN | 0.117, 0.412, 0.364 |
[16] | EDA RNN | 0.124, 0.267, 0.177 |
Visual | Arousal | Current | MobileNet-v2 CNN | 0.1220, 0.7838, 0.7770 |
[10] | Random Forests | N/A, N/A, 0.514 |
[14] | Multitask Lasso | N/A, N/A, 0.312 |
[15] | End2You | N/A, N/A, 0.358 |
[16] | CNN + RNN | 0.201, 0.415, 0.346 |
[17] | RNN | N/A, N/A, 0.413 |
[18] | SVM + Subject Fusion | N/A, N/A, 0.682 |
Valence | Current | MobileNet-v2 CNN | 0.0823, 0.7789, 0.7715 |
[10] | Random Forests | N/A, N/A, 0.498 |
[14] | SVM | N/A, N/A, 0.438 |
[15] | End2You | N/A, N/A, 0.561 |
[16] | CNN + RNN | 0.107, 0.549, 0.511 |
[17] | RNN | N/A, N/A, 0.527 |
[18] | SVM + Subject Fusion | N/A, N/A, 0.468 |
[19] | CNN + RNN | 0.107, 0.554, 0.507 |
Multimodal | Arousal | Current | Optimizable Ensemble + MobileNet-v2 | 0.0640, 0.9435, 0.9363 |
[10] | Random Forests + Linear Regression | 0.118, 0.776, 0.762 |
[14] | Hierarchical Fusion + Lasso | N/A, N/A, 0.657 |
[16] | Kalman Filters | 0.115, 0.774, 0.770 |
[20] | Multiple Linear Regressors | N/A, N/A, 0.833 |
[22] | SVR + Kalman Filters | N/A, N/A, 0.703 |
Valence | Current | Optimizable Ensemble + MobileNet-v2 | 0.0431, 0.9454, 0.9364 |
[10] | Random Forests + Linear Regression | 0.104, 0.634, 0.624 |
[14] | Hierarchical Fusion + Multitask Lasso | N/A, N/A, 0.515 |
[16] | Kalman Filters | 0.100, 0.689, 0.687 |
[20] | Multiple Linear Regressors | N/A, N/A, 0.596 |
[22] | SVR + Kalman Filters | N/A, N/A, 0.681 |
The following references showed the potential of using RNNs and CNNs to perform AR. Gunes and Schuller [
23] trained two separate deep CNNs. The CNNs were pre-trained on a large dataset, and then fine-tuned on a dataset of audio and video. Gunes et al. [
24] showed the potential of using LSTMs for dimensional emotion prediction. Ringeval et al. [
25] used a LSTM RNN to perform regression for dimensional emotional recognition based on visual, audio, and physiological modalities. Chen et al. [
26] used LSTMs to identify the long-term inter-dependency within segments of a multimedia signal. They proposed a new conditional attention fusion scheme, in which modalities are weighted according to their current and previous features. Tzirakis et al. [
27] used a shallow network followed by identity mapping to extract features from raw audio and video signals. The obtained features were then inputted into a two-layer LSTM. The LSTM was trained from end-to-end instead of training it on individual components separately. This approach outperformed traditional approaches based on baseline handcrafted features in the RECOLA dataset. Huang et al. [
28] applied a deep neural network and hypergraphs for emotion recognition using facial features. Facial features were extracted from the last fully connected layer of the trained CNN, which were then treated as attributes for the hypergraph. Ebrahimi et al. [
29] used a CNN to extract features that were input into an RNN. The RNN categorizes the emotions in RECOLA’s video recordings.
Most state-of-the-art studies performed complex processing, feature extraction, and multimodal fusion processes. Despite these efforts, the prediction performances of their models can still be improved. We were able to perform simple processing and achieved better results using only the EDA and ECG recordings of RECOLA in [
13]. In this study, we aim to further improve our prediction performance and initiate our work on the video recordings of RECOLA.
2.3. Study Contributions
Our goal in this work is to design and develop a novel adaptable intervention to remediate cognitive impairments in people with schizophrenia using virtual reality (VR), based on synergistic computer science and psychology approaches. A novel machine learning approach which depends on visual and physiological sensory data is required to adaptively adjust the virtual environment to the affective states of users. In the future, we also aim to automatically optimize the level of cognitive effort requested by users, while avoiding discouragement. AR can help determine the affective states of users and this study presents the first milestone of this project. In our proposed solution, a multi-sensory system will be used to improve the prediction of affective states. The information from the various data sources in the system will be treated to predict the affective state of the user, during VR immersion, through classical and deep machine learning techniques.
Figure 2 displays a high-level diagram of our proposed solution. It is composed of subsystems: one system for each data modality, namely the visual (video) and physiological (EDA and ECG) data modalities. We chose to focus on one data modality at a time to perfect the results for each modality first, before we combine all modalities in our final system (i.e., multimodal fusion). In [
13], we operated on physiological data from RECOLA’s physiological signal recordings of EDA and ECG. We processed the EDA and ECG signals by applying time delay, early features fusion, arousal and valence annotation labelling, and data shuffling and splitting. We used early fusion to combine the EDA and ECG modalities at the feature level. We exploited an optimizable ensemble regressor for the purpose of predicting continuous dimensional emotion annotations in terms of arousal and valence values.
In this study, we extend our previous work from [
13] by (1) adding preprocessing operations; (2) applying feature standardization; (3) applying feature selection; (4) testing additional regressors, namely tree regressors and exploring RNNs, specifically a bidirectional LSTM (BiLSTM); and (5) introducing decision fusion. Furthermore, we introduce an additional source of data, namely video recordings. We initially use data from RECOLA as a proof-of-concept mechanism. In the future, we will operate on real data that we will collect in our laboratory.