Emotion Assessment Using Feature Fusion and Decision Fusion Classification Based on Physiological Data: Are We There Yet?

Emotion recognition based on physiological data classification has been a topic of increasingly growing interest for more than a decade. However, there is a lack of systematic analysis in literature regarding the selection of classifiers to use, sensor modalities, features and range of expected accuracy, just to name a few limitations. In this work, we evaluate emotion in terms of low/high arousal and valence classification through Supervised Learning (SL), Decision Fusion (DF) and Feature Fusion (FF) techniques using multimodal physiological data, namely, Electrocardiography (ECG), Electrodermal Activity (EDA), Respiration (RESP), or Blood Volume Pulse (BVP). The main contribution of our work is a systematic study across five public datasets commonly used in the Emotion Recognition (ER) state-of-the-art, namely: (1) Classification performance analysis of ER benchmarking datasets in the arousal/valence space; (2) Summarising the ranges of the classification accuracy reported across the existing literature; (3) Characterising the results for diverse classifiers, sensor modalities and feature set combinations for ER using accuracy and F1-score; (4) Exploration of an extended feature set for each modality; (5) Systematic analysis of multimodal classification in DF and FF approaches. The experimental results showed that FF is the most competitive technique in terms of classification accuracy and computational complexity. We obtain superior or comparable results to those reported in the state-of-the-art for the selected datasets.


Introduction
Emotion is an integral part of human behaviour, exerting a powerful influence in mechanisms such as perception, attention, decision making and learning. Indeed, what humans tend to notice and memorise are usually not monotonous, commonplace events but the ones that evoke feelings of joy, sorrow, pleasure, or pain [1]. Therefore, understanding emotional states is crucial to understand human behaviour, cognition and decision making. The computer science field dedicated to the study of emotions is denoted as Affective Computing, whose modern potential applications include, among many others: (1) automated driver assistance-e.g., through an alert system monitoring and warning the user for sleepiness, unconscious or unhealthy states potentially hindering driving; (2) healthcare-e.g., through wellness monitoring applications identifying causes of stress, anxiety, depression or chronic diseases; (3) adaptive learning-e.g., through a teaching application able to adjust the content delivery rate and number of iterations according to the user enthusiasm and frustration

State of the Art
In literature, human emotion processing is generally described using two models: One decomposing emotion in discrete categories divided into basic/primary (arriving from innate, fast and in response to "flight-or-fight" behaviour) and complex/secondary emotions (deriving from cognitive processes) [3,4]. On the other hand, the second model quantifies emotions into continuous dimensions. A popular model, proposed by Lang [5], suggested a Valence (unpleasant-pleasant level) versus Arousal (activation level) two-dimensional model [6], which we adopt in this work. Concerning affect elicitation, it is generally performed through films snippets [6], virtual reality [7], music [8], recall [9], or stressful environments [6], with no commonly established norm on which is the optimal methodology for ER elicitation.
Many physiological modalities and features have been evaluated for ER, namely Electroencephalography (EEG) [28][29][30], Electrocardiography (ECG) [31][32][33], Electrodermal Activity (EDA) [34][35][36], Respiration (RESP) [26], Blood Volume Pulse (BVP) [26,35] and Temperature (TEMP) [26]. Multi-modal approaches have prevailed; however, there is still no clear evidence of which feature combinations and physiological signals are the most relevant. The literature has shown that the classification performance improves with the simultaneous exploitation of different signal modalities [2,8,10,37], and that modality fusion can be performed at two main levels: FF [24,38,39] and DF [8,26,37,40,41]. In the former, features are extracted from each modality and latter concatenated to form a single feature vector space, to be used as input for the ML model. On the other hand, in DF, from each modality, a feature vector is extracted to form a classifier prediction through a voting system. Hence, with k modalities, k classifiers will be created leading to k predictions that can be combined to yield a final result. Both methodologies are found in the state-of-the art [42], but it is unclear which is the best to use in the area of ER using multimodal physiological data obtained from non-intrusive wearable technology.
Detailed information on the current state-of-the-art in a more generalized perspective, we refer the reader to the surveys [2,11,[43][44][45][46][47] and references therein, where a comprehensive review of the latest work on ER using ML and physiological signals can be found, highlighting the main achievements, challenges, take-home messages, and possible future opportunities.
The present work extends the state-of-the-art of ER through: (1) Classification performance analysis, in the arousal/valence space, of ER for five publicly available datasets that cover multiple elicitation methods; (2) Summarising the ranges of the classification accuracy reported across the existing literature for the evaluated datasets; (3) Characterising the results for diverse classifiers, sensor modalities and feature set combinations for ER using accuracy and F1-score as evaluation metrics (the later not being commonly reported albeit important to evaluate classification bias); (4) Exploration of an extended feature set for each modality, analyzing also their relevance through feature selection; (5) Systematic analysis of multimodal classification in DF and FF approaches, with superior or comparable results to those reported in the state-of-the-art for the selected datasets.

Methods
To evaluate the classification accuracy in ER from physiological signals, we adopted the two dimensional Valence/Arousal space. As previously mentioned, the ECG, RESP, EDA, and BVP signals are used, and we compare FF and DF techniques in a feature space based framework. In the forthcoming sub-sections, a more detailed description of each approach is presented.

Feature Fusion
As previously mentioned, when working with multi-modal approaches the exploitation of the different signal modalities can be performed resorting to different techniques. We start by testing the FF technique. In FF, the features are independently extracted from each sensor modality (in our case ECG, BVP, EDA, and RESP), and are concatenated afterwards to form a single, global, feature vector (570 features for EDA, 373 for ECG, 322 for BVP, and 487 for RESP, implemented and detailed in the BioSPPy software library https://github.com/PIA-Group/BioSPPy). Additionally, we applied sequential forward feature selection (SFFS) in order to preserve only the most informative features, and save time and computational power of the machine learning algorithm to be applied in the next step. All the presented methods were implemented in Python and made available as open source software https://github.com/PIA-Group/BioSPPy.

Decision Fusion
In contrast to FF, in DF, from each sensor signal, a feature vector is extracted and used independently to train and learn a classifier, so that each modality returns a set of predicted labels. Hence, with k modalities, k classifiers will be created returning k predictions per sample. The returned predictions are then combined to yield a final result, in our case, via a weighted majority voting system. In this voting system, the ensemble decides on the class that receives the highest number of votes taking into account all sensor modalities, and a weight (W) parameter per modality to give the more competent classifiers a greater power for the final decision. The weights were chosen for each modality according to the classifier accuracy on the validation set. In case of a draw in the class prediction, the selection is random.

Classifier
To perform the classification seven SL classifiers were tested: K-Nearest Neighbour (k-NN); Decision Tree (DT); Random Forest (RF); Support Vector Machines (SVM); AdaBoost (AB); Gaussian Naive Bayes (GNB); and Quadratic Discriminant Analysis (QDA). For more detail regarding these classifiers, the author refers the reader to [48] and references therein.
A comprehensive study of these classifiers performance and parameter tunning was performed using 4-fold Cross Validation (CV) to ensure a meaningful validation and avoiding overfitting. The value of 4 was selected to optimise the number of iterations and the homogeneity in number of the classes in the training and test set, since some of the datasets used were highly imbalanced. The best performing classifier was chosen using Leave-One-Subject-Out (LOSO) to be incorporated into the FF and DF frameworks.
To obtain a measurable evaluation of the model performance, the following metrics are computed: Accuracy-TP+TN TP+TN+FP+FN ; Precision-TP TP+FP ; Recall-TP TP+FN ; F1-score-the harmonic mean of precision and recall [49]. Nomenclature: TP-True Positive; FP-False Positive; FP-False Positive; FN-False negative.

Experimental Results
In this section, we start by introducing the datasets used in this paper, followed by an analysis and classification performance comparison of the FF and DF approaches.

Datasets
In the scope of our work we used five publicly available datasets for ER, commonly used in previous work for benchmarking: [7]: contains the physiological signals of interest to our work (EDA, RESP, ECG, and BVP) of 18 individuals using two devices based on the BITalino system [50,51] (one placed on the arm and the other on the chest of the participants), collected while the subjects watched seven VR videos to elicit the emotions: Boredom, Joyfulness, Panic/Fear, Interest, Anger, Sadness and Relaxation. The ground-truth annotations were obtained by the subjects self-report per video using the Self-Assessment Manikin (SAM), in the Valence-Arousal space. For more information regarding the dataset, the authors refer the reader to [7].

Multimodal Dataset for Wearable Stress and Affect Detection (WESAD) [6]
: contains EDA, ECG, BVP, and RESP sensors data collected from 15 participants using a chest-and a wrist-worn device: a RespiBAN Professional (biosignalsplux.com/index.php/respiban-professional) and an Empatica E4 (empatica.com/en-eu/research/e4) under 4 main conditions: Baseline (reading neutral magazines); Amusement (funny video clips); Stress (Trier Social Stress Test (TSST) consisting of public speaking and a mental arithmetic task); and lastly, meditation. The annotations were obtained using 4 self-reports: PANAS; SAM in Valence-Arousal space; State-Trait Anxiety Inventory (STAI); and Short Stress State Questionnaire (SSSQ). For more information regarding the dataset, the authors refer the reader to [6].

A dataset for Emotion Analysis using Physiological Signals (DEAP) [8]: contains EEG and
peripheral (EDA, BVP, and RESP) physiological data from 32 participants, recorded as each watched 40 one-minute-long excerpts of music videos. The participants rated each video in terms of the levels of Arousal, Valence, like/dislike, dominance and familiarity. For more information regarding the dataset, the authors refer the reader to [8].

Multimodal dataset for Affect Recognition and Implicit Tagging (MAHNOB-HCI) [52]:
contains face videos, audio signals, eye gaze data, and peripheral physiological data (EDA, ECG, RESP) of 27 participants watching 20 emotional videos, self-reported in Arousal, Valence, dominance, predictability, and additional emotional keywords. For more information regarding the dataset, the authors refer the reader to [52]. 5. Eight-Emotion Sentics Data (EESD) [9]: contains physiological data (EMG, BVP, EDA, and RESP) from an actress during deliberate emotional expressions of Neutral, Anger, Hate, Grief, Platonic Love, Romantic Love, Joy, and Reverence. For more information regarding the dataset, the authors refer the reader to [9]. Table 1 shows a summary of the datasets used in this paper, highlighting their main characteristics. One should notice that the datasets are heavily imbalanced.

Signal Pre-Processing
The raw data recorded from the sensors usually shows a low signal-to-noise ratio, thus, it is generally necessary to pre-process the data, namely filtering to remove motion artefacts, outliers, and further noise. Additionally, since different modalities were acquired, different filtering specifications are required according to each sensor modality. Considering what is typically found in the state-of-the-art [11], the filtering for which each modality was performed as follows: After noise removal, the data was segmented into 40 s sliding windows with 75% overlap. Lastly, the data was normalised per user, by subtracting the mean and dividing by the standard deviation, to values between 0-1 to remove subjective bias.

Supervised Learning Using Single Modality Classifiers
The ER classification is performed with a classifier tuned for Arousal and another for Valence. Table 2 presents the experimental results for the SL techniques.
As it can be seen, for the ITMDER dataset, the state-of-the-art results [7] were available for each sensor modality, which we display and, overall our methodology was able to achieve superior results. Additionally, altogether, we observe higher accuracy values in the Valence dimension compared to the Arousal scale. Thirdly, for the WESAD dataset, the F1-score drops significantly to 0.0, compared to the Accuracy score value. The F1-score low value derives from the fact, that the class labels were largely unbalanced, with some of the test sets having none of one of the labels. To conclude, overall, all the sensors modalities display competitive results with no individual sensor modality standing out as the optimal for ER.
We present the classifiers used per sensor modality and class dimension in Table 3. Additionally, the features obtained using the forward feature selection algorithm are displayed in Tables 4 and 5, for the Arousal and Valence dimensions, respectively. As shown, they explore similar correlated aspects in each modality.
Both the presented classifiers and features were selected via a 4-fold CV, to be used for the SL evaluation and for the DF algorithm, which is detailed in the next section. Hence, no classifier was generally able to emerge as the optimal for ER on the aforementioned axis. Lastly, concerning the features for each modality, we used 570, 373, 322, and 487 features respectively for the EDA, ECG, BVP, and RESP sensor data. However, such high dimension feature vector can be highly redundant and has many zero column features, therefore, we were able to reduce the feature vector without significant degradation of the classification performance. Figure A1 in Appendix A displays two histograms merging the features used in the SL methodologies in all the datasets for the Arousal and Valence axis, respectively. The figure shows that most features are selected via the SFFS methodology, specifically for each dataset (a value of 1 means that the features were selected in just one dataset). The features EDA onsets spectrum mean value, and BVP signal mean are selected in 2 datasets for the Arousal axis; while, the features EDA onsets spectrum mean value (in 4), RESP signal mean (in 2), BVP (in 2) signal mean, and ECG NNI (NN intervals) minimum peaks value, are repeated for the Valence axis. Table 2. Experimental results in terms of the classifier's Accuracy (1st row) and F1-score (2nd row) in %. All listed values are obtained using Leave-One-Subject-Out (LOSO). Nomenclature: SOA-State-of-the-art results; EDA H, EDA F-EDA obtained on a device placed on the hand and finger, respectively. The SOA column contains the results found in the literature [7]. The best results are shown in bold.

Decision Fusion vs. Feature Fusion
In the current sub-section we present the experimental results for the DF and FF methodologies. Table 6 shows the experimental results in terms of Accuracy and F1-score for the Arousal and Valence dimensions in the 5 studied datasets, along with some state-of-the-art results. As it can be seen, once gain, both of our techniques outperform the results obtained for ITMDER [7], with more expression in the Valence dimension. Similarly for the DEAP dataset [8], where only for the Valence axis in terms of Accuracy we did not succeed, attaining, however, competitive results, and surpassing in terms of F1-score.
On the other hand, with the MAHNOB-HCI dataset [53], our proposal does not attain the literature results. For the EESD and the WESAD datasets, no state-of-the-art results are presented since it is yet, to the best of our knowledge, to be applied to ER. Thus, we denote as an un-explored annotation dimension which we evaluate in the present paper. Secondly, when comparing DF with FF, the former surpasses the latter for the EESD dataset in both the Arousal and Valence scale. For the remaining datasets, very competitive results are reached on both techniques. Regarding the computational time, FF is more competitive than DF, with an average execution time two orders of magnitude lower comparatively to DF (Language: Python 3.7.4; Memory: 16 GB 2133 MHz LPDDR3; Processor: 2.9 GHz Intel Core i7 quadruple core). Table 7 presents the classifiers used per dataset and sensor modality for the Arousal and Valence dimension in the FF methodology.
The experimental results show that the selection was: 2 QDA; 1 SVM; 1 GNB; 1 DT (for the Arousal scale); and 2 RF; 1 SVM; 1 GNB; and 1 QDA (for the Valence scale). These results exhibit once again that, as for the SL techniques, no particular type of classifier was globally selected for all the datasets. Additionally, Table 8 displays the features used per dataset and sensor modality for the Arousal and Valence dimension in the FF methodology.
Results also showed that, similarly to the SL methodology, most features are specific per to a given dataset, with zero features being selected through the SFFS in common in all the datasets feature selection step.
In summary, this paper explored the datasets in new emotion dimensions and evaluation metrics yet to be reported in the literature, and attained similar or competitive results comparatively to the available state-of-the-art. The experimental results showed that between FF and DF using SL, very similar results are attained, and the best performing methodology is highly dependent on the dataset. These results are possibly due to the features being different for each dataset and sensor modality. In the SL classifier results, the best performing sensor modality is uncertain. While the DF methodology displayed the higher computation and time complexity. Therefore, considering these points, we select the FF methodology as the best modality fusion option since, with a single classifier, and pre-selected features, high results are reached with low processing time and computational complexity. Table 6. Experimental results for the FF and DF methodologies in terms of Accuracy (A) and F1-score (F1), and time (T) in seconds, per dataset for the Arousal dimension in the FF methodology. Results obtained using LOSO. The SOA column contains the results found in the literature (ITMDER [7], DEAP [8], MAHNOB-HCI [53]). The best results are shown in bold.

Conclusions and Future Work
Over the past decade, the field of affective computing has grown, with many datasets being created [6][7][8][9]52], however, a consolidation is lacking concerning: (1) What are the ranges of the expected classification performance; (2) The definition of the best sensor modality, SL classifier and features per modality for ER; (3) Which is the best technique to deal with multimodality and their limitations (FF or DF); (4) Selection of the classification model. Therefore, in this work, we studied the recognition of low/high emotional response in two dimensions: Arousal and Valence, for five publicly available datasets commonly found in literature. For this, we focus on physiological data sources easily measured from pervasive wearable technology, namely ECG, EDA, RESP and BVP data. Then, to deal with the multimodality, we analyse two techniques: FF and DF.
We extend the state-of-the-art with: (1) Benchmarking the ER classification performance for SL, FF and DF in a systematic way; (2) Summarising the accuracy and F1-score (important due to the imbalanced nature of the datasets); (3) Comprehensive study of SL classifiers and extended feature set for each modality; (4) Systematic analysis of multimodal classification in DF and FF approaches. We were able to obtain superior or comparable results to those found in literature for the selected datasets. Experimental results showed that FF is the most competitive technique.
For future work, we identified the following research lines: (1) Acquisition of additional data for the development of a subject-dependent model, since emotions are highly subject-dependent resulting, according to literature [11], in a higher classification performance; (2) Grouping the users by clusters of response might provide a look into sub-groups of personalities, a further parameter that must be taken into consideration when characterising emotion; (3) As stated in Section 4.3 we used a SFFS methodology to select the best feature set to use in all our tested techniques, however, it is not optimal, so the classification results using additional feature selection techniques should be tested; (4) Lastly, our work is highly conditioned on the extracted features, while lately, higher focus has been made to Deep Learning techniques, but in an approach where the feature extraction step is embedded in the neural network -ongoing work concerns the exploration and comparison of feature engineering and data representation learning approaches, with emphasis on performance and explainability aspects. Funding: This work has been partially funded by the Xinhua Net Future Media Convergence Institute under project S-0003-LX-18, by the Ministry of Economy and Competitiveness of the Spanish Government co-founded by the ERDF (PhysComp project) under Grant TIN2017-85409-P, and by FCT/MCTES through national funds and when applicable co-funded EU funds under the project UIDB/EEA/50008/2020.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.   Figure A1. Histogram combining the features used in the SL (Supervised Learning) methodologies in all the datasets for the Arousal and Valence axis in (a,b), respectively. For information regarding the features the authors refer the reader to (https://github.com/PIA-Group/BioSPPy).