Emotion Recognition from ECG Signals Using Wavelet Scattering and Machine Learning

: Affect detection combined with a system that dynamically responds to a person’s emotional state allows an improved user experience with computers, systems, and environments and has a wide range of applications, including entertainment and health care. Previous studies on this topic have used a variety of machine learning algorithms and inputs such as audial, visual, or physiological signals. Recently, a lot of interest has been focused on the last, as speech or video recording is impractical for some applications. Therefore, there is a need to create Human–Computer Interface Systems capable of recognizing emotional states from noninvasive and nonintrusive physiological signals. Typically, the recognition task is carried out from electroencephalogram (EEG) signals, obtaining good accuracy. However, EEGs are difﬁcult to register without interfering with daily activities, and recent studies have shown that it is possible to use electrocardiogram (ECG) signals for this purpose. This work improves the performance of emotion recognition from ECG signals using wavelet transform for signal analysis. Features of the ECG signal are extracted from the AMIGOS database using a wavelet scattering algorithm that allows obtaining features of the signal at different time scales, which are then used as inputs for different classiﬁers to evaluate their performance. The results show that the proposed algorithm for extracting features and classifying the signals obtains an accuracy of 88.8% in the valence dimension, 90.2% in arousal, and 95.3% in a two-dimensional classiﬁcation, which is better than the performance reported in previous studies. This algorithm is expected to be useful for classifying emotions using wearable devices.


Introduction
Affective computing (AC) aims for the integration of human emotional states in Human-Computer Interfaces and there is an actual need to develop and improve algorithms that allow machines to recognize human emotional states with different objectives. Emotional states are subjective experiences that are commonly classified in two or more dimensions. Russell's circumplex model proposes that all emotions arise from two fundamental neurophysiological systems-one related to valence and the other to arousal; valence refers to the level of pleasantness or unpleasantness of an emotion and arousal refers to its activation or deactivation level [1] (see Figure 1). Some researchers have tried to automatically correlate the emotional state dimensions with an input signal such as speech, face image recognition, or physiological signals [2][3][4]. In this way, some databases have been made publicly available to establish a standard framework to compare different methods and algorithms [5,6]. This work presents a novel emotion recognition algorithm based on wavelet scattering feature extraction and supervised machine learning and shows its performance using AMIGOS: a dataset for mood, personality, and affect research on Individuals and GrOupS [7].

Background
The computer task of emotion recognition has been studied over many years and the most used approaches include using face and body images, video recordings, and audio recordings (speech) as inputs. Most recent studies employ machine learning algorithms to perform the recognition task on an annotated dataset. Kahou et al. [8] used a multimodal deep learning approach based on convolutional neural networks (CNN), deep belief networks (DBN), and support vector machines (SVM) on audio and video datasets (AFEW) [9] used in the EmotiW Challenge [10]. Ranganathan et al. [11] fed a convolutional DBN (CDBN) with extracted regions of interest (ROI) from their published emoFBVP dataset. Fan et al. [12] combined long short-term memory (LSTM) and 3D convolutional networks (C3D). Hu et al. [13] proposed a novel method based on cascading a local enhanced motion history image (LEMHI) and a CNN-LSTM, comparing results over three different datasets (AFEW, the extended Cohn-Kanade dataset (CK+) [14], and MMI [15]).
Other types of signals, such as physiological signals, have also been explored as inputs for classification. Zheng et al. [16] recorded and extracted features from EEG and eye-tracking data to feed a support vector machine (SVM) classifier. Alhagry et al. [17] used a LSTM Recurrent Neural Network to classify emotions from the EEG signal of DEAP dataset [5], obtaining up to 85% accuracy. Balan et al. [18] proposed a comparative analysis between different machine learning and deep learning techniques, also using the DEAP dataset. Paszkiel [19] used blind signal separation for EEG signal reconstruction, allowing the identification of the source generating a given potential. However, recording audio, videos, or EEG for ambulatory or daily life application of emotion recognition is unfeasible in most cases. Thus, a less-intrusive physiological signal measurement is needed. In this way, some researchers have used less intrusive measurements such as electrocardiogram (ECG) and galvanic skin response (GSR) to classify the signals to an elicited emotion using a machine learning or deep learning classifier [4,20].
Some studies use ad hoc obtained datasets, usually with a small number of signals and volunteers. To overcome this situation, some researchers have made large, publicly available datasets with more volunteers, a mix of different signal measurements, and emotion elicitation approaches, allowing the validation and comparison of different classification algorithms in different studies. Metrics such as accuracy and F1-score were used in the comparison [21]. Some examples are DREAMER [22], HCI-Tagging [6], and AMIGOS [7]. Some recent publications use at least one of these datasets. For example, a probabilistic Bayesian deep learning algorithm has been employed to classify valence in the DREAMER (accuracy, 0.90; F1-score, 0.88) and AMIGOS (accuracy, 0.86; F1-score, 0.83) datasets [23]. HCI-Tagging data is employed in [24] to extract intrinsic mode function features from the ECG and feed a K-Nearest Neighbors (KNN) classifier, obtaining an accuracy of 0.558 for arousal and 0.597 for valence classification. In [25], the authors use deep learning to classify ECG signals from the AMIGOS dataset, obtaining 0.76 and 0.68 F1-score for arousal and valence, respectively. In the original publication of AMIGOS [7], manually extracted time and frequency features are used, obtaining mean F1-scores of 0.545 and 0.551, respectively. In [26], mean F1-scores of 0.851 and 0.837 were obtained with a CNN self-supervised approach. Tung et al. [27] employed entropy domain features and an XGboost model for classification, obtaining mean F1-scores between 0.56 and 0.63.
Regarding model validation, one of the most used methods in this field is leave k subjects out (LkSO) cross-validation. ECG may show subject-specificity [28], so LkSO is employed to assess how the algorithms generalize to a new subject and prevent overfitting. Among LkSO, leave one subject out cross-validation (LOSO) is the most used in previous works [23,27]. Another validation scheme is 10-fold cross-validation, which is employed in [26,29,30]. Moreover, a comparison of validation schemes is performed in [24], where they find consistent results with other studies, in which arousal classification accuracy is higher than valence.

Dataset
A publicly available database, AMIGOS, was used in this work [7]. This database was preferred over other available databases due to its relatively high number of participants, recent publication, large number of signals and experiments, standardized data collection methods, and detailed and high-resolution annotation. This database includes EEG, ECG, galvanic skin response (GSR), and face video data from 40 participants recorded during two experiments. The first experiment (short videos) consisted of individually watching 16 short videos of less than 250 seconds each. The second experiment (long videos) consisted of watching 4 long videos of around 14 min each in groups (see Figure 2). Experiments were done in a lab-controlled environment to record changes in physiological signals when exposed to the video stimuli. The face video recording was subsequently externally score-annotated by experts in arousal and valence dimensions with a time resolution of 20 s (length of each annotated time window) for both short and long experiments. In addition, in the case of short experiments, a self-annotation was carried out by means of a questionnaire after each short-video presentation.

Algorithm Overview
The algorithm described in this work can be represented by the block diagram in Figure 3. Data are first preprocessed, missing data are filtered out, then the signal is band-pass filtered and segmented. Features are extracted using two different approaches: classical time and frequency features, and wavelet scattering features; later, the features are fed into different classifiers, which are evaluated with a 10-fold cross validation for their accuracy and F1-score performance.

Data Processing
In this work, we used the recordings of the short and the long experiments together. The database was first preprocessed in search of missing data and the subjects with more than 30% missing ECG measurements or without annotations were omitted. The ECG signal was then filtered with a band-pass Butterworth filter between 0.5 and 30 Hz. The self and external annotation scores for the valence dimension range from −1 to 1 and were converted to categorical values (positive or negative) using 0 as the threshold. Similarly, the annotations scores for the arousal dimension range from 0 to 9 and were converted to categorical values (high or low), with a threshold of 5. After the conversion, Fisher's exact test [31] was performed between self and external labels for short experiments, as there is no self-annotation for the long experiment. The test rejected the null hypothesis at the 5% significance level, so there is an association between the external and selfannotations. Based on this result, the higher time resolution of the external annotations, and the availability of labels for both experiments, the classification was performed using the external annotations as targets.
Since the external annotations were made at 20-s time windows, we divided each ECG signal into segments of that length and assigned them to their respective label. Each class (positive/negative valence and high/low arousal) contains different amounts of data. For example, from the 8610 segments of the long experiment, 29% present positive valence and 71% negative valence, and 20% present high arousal and 80% low arousal. Given this imbalance, classifying all the segments as negative valence and low arousal would give 71% and 80% accuracy in valence and arousal, respectively [32]. There are different methods to avoid the undesirable effects of an imbalanced dataset, namely, data reduction, data augmentation, undersampling, oversampling, resampling, and cross-validation [33]. In this work, a reduction of the number of data from the overrepresented classes was performed; therefore, the number of elements of the smallest group is preserved and random samples are discarded from the classes with more elements until the same amount of data is achieved in all groups.

Feature Extraction
To classify the segments in the dimensions of arousal and valence with classical machine learning algorithms, characteristics of the measured signal must be extracted. In this work, we considered classical time-and frequency-based features and compared them with wavelet scattering features, as detailed below.

Time Domain Features
Commonly extracted features in emotion recognition include time domain features [7]. In this study, 17 features were extracted, corresponding to the root mean square successive differences (RMSSD) in interbeat intervals (IBI); the proportion of total IBI that are longer than 20 ms and 50 ms (pNN20, pNN50); 2 Poincare coefficients [34]; and 6 statistical parameters of heart rate (HR) and heart rate variability (HRV), namely, mean, standard deviation, skewness, kurtosis of the raw signal over time, and number of times the signal value is above/below the mean ± 1 standard deviation.

Frequency Domain Features
Other commonly extracted features are related to the transformation of the signal into the frequency domain and the power of the signal associated to different frequency bands: very low frequency (VLF), low frequency (LF), mid frequency (MF), and high frequency (HF). Several studies indicate that there is some correlation between the ratio of LF/HF power and the sympathetic and parasympathetic activity of the nervous system [35,36]; therefore, 7 frequency domain features were extracted: VLF, LF, MF, HF, LF/HF ratio, power spectral entropy, sample entropy, and Shannon entropy [27].

Wavelet Scattering Features
The extraction of features in the time and frequency domains might not be enough to capture the detail and variability present in the ECG signal during emotional state changes; therefore, we employed mathematical tools based on the wavelet transform to extract more complex features [37]. The wavelet scattering algorithm subdivides each signal in a determined number of scattering windows and extracts features from them using wavelet transformations. Then, each scattered window is classified independently and the classification of the original segment is performed based on a uniform weighting (voting) of the classification of each scattered window. The wavelet scattering algorithm consists of a three-stage iterative transformation of the signal: wavelet convolution, modulation, and filtering. This architecture is similar to a Convolutional Neural Network (CNN) with the difference that the convolution filters are not learned but, instead, they are predefined wavelet functions. The scattering coefficients are intended to have low variance within a class and high variance across classes. Moreover, they are insensitive to input translations on an invariance scale and have some desirable properties such as multiscale contractions, linearization of hierarchical symmetries, and produce sparse representations of data [38][39][40][41]. The algorithm used in this work is the wavelet scattering function implemented in the MATLAB Signal Processing Toolbox. The function was set with an invariance scale of 20 s, with the default parameters for filters and wavelets, resulting in two filter banks with 8 and 1 wavelets per octave, respectively.

Dimensionality Reduction
The wavelet scattering method generates a large number of features, so it could be convenient to reduce the number of features because smaller datasets are easier to explore and make analyzing data much easier and faster for machine learning algorithms. In this work, Principal Component Analysis (PCA) was used to reduce the dimensionality by creating linear combinations of the original features and selecting subsets that capture as much of the information as possible [42]. The PCA option of the MATLAB Classification Learner App that allows selecting the number of desired new features was used. The performance of several classifiers with different numbers of PCA obtained features was later assessed.

Classification Methods
To build and evaluate the performance of different classification models, the MATLAB Classification Learner App was used, allowing quick training, testing, and validation of several classifiers. The methods used were Linear Discriminant Analysis (LDA) [43], decision trees (DT) with a maximum number of splits of 20 [44], Kernel Naive Bayes [45], KNN with K = 10 [46], linear Support Vector Machines (SVM) [47], and Ensemble Bagged Tree classifiers [48] with default parameters for screening. The previously obtained features were used as inputs and the labels of each segment or scattered window as target. Later, the classifiers with the best performance in terms of accuracy and F1-score were tested with PCA-obtained features.

Validation Methods
Given the large amount of data, k-fold cross-validation [49] with k = 10 was used. In k-fold cross-validation, the dataset is first randomly divided into k disjoint folds with approximately the same number of samples, and then every fold in turn plays the role for testing the model induced from the other k-1 folds. However, if we want to build a model that can generalize to new subjects, k-fold cross validation can be this overly optimistic since records from the same subject are present in both training and test sets [50]. Leave-one-subject-out cross-validation (LOSO CV) prevents this by leaving out all the data from a given subject for testing while using the rest of the data for training the model and repeating the process for every subject [51]. Therefore, LOSO CV was also performed, as reported in other works [7].

Data Processing
Participants with missing data and incomplete measurements were removed. Participant 9 was removed for having less than 70% of the total data and three more subjects (participants 8, 24, 28) for having missing annotations or no data from the long experiment.
After dividing each signal into 20-s segments, 94 times windows were obtained for each of the 36 subjects in the short experiment, leading to 3384 segments. Moreover, 246 times windows were obtained for each subject in the long experiment leading to 8610 segments. Therefore, considering both short and long videos, a total of 11,994 segments were obtained. After deleting some segments with missing data and balancing the groups for each experiment, 5860 valence (50% positive and 50% negative) and 4070 arousal segments (50% high and 50% low) were obtained for the short and long experiments together. When using both dimensions (four classes), the dataset was left with 1028 segments, equally distributed for each class (257 segments each).

Extracted Features
From each ECG segment, 24 times and frequency domain features were extracted. Additionally, the wavelet scattering algorithm set with an invariance scale of 20 s subdivides each signal into 5 scattering windows (so there are 5 times more scattered windows than original samples) and extracts a vector of 210 scattering coefficients for each of the 5 scattered windows of each signal. The coefficients from each scattered window are used as separate inputs, assuming that each scattered window inherits the label of its original signal. After classification, the outputs for each of the 5 scattered windows are combined in an equal-weighted voting to classify the original window.

Classification in One Dimension
The results below summarize the performance of classifiers with the two types of features, both for one dimension (valence or arousal) with two classes (positive and negative for valence, high and low for arousal) and for two dimensions (valence and arousal) with four classes I-IV (see Figure 1).
A first screening of classifiers was performed across each dimension using default parameters, considering time and frequency features and wavelet scattering features with scattered windows separately. Table 1 summarizes the overall performance (accuracy and F1-score) of different classifiers. From the results, it can be observed that the mean performance of the classifiers using wavelet scattering features is higher than using time and frequency features for both valence and arousal. The classifiers that perform best with wavelet scattering are Ensemble (accuracy: 89.3% valence and 89.1% arousal) and KNN (82.7% valence and 81.2% arousal), followed by SVM, Discriminant Analysis, Decision Tree, and Naïve Bayes. After screening, the best performance classifiers (accuracy and F1-score over 0.80) with default parameters and wavelet scattering features were further analyzed. Figure 4a,b show the confusion matrix for the 10-NN classifier, for the valence and arousal dimensions, respectively. The overall accuracy is 82.7% in the valence dimension and 81.2% in the arousal dimension.

Classification in Two Dimensions
A similar procedure was applied to classify the signals in two dimensions (four classes) using data from short and long experiments. A total of 24 times and frequency features and 210 wavelet scattering features were considered. Table 2 presents the classifier screening result, showing that the best classifiers are Ensemble and KNN. KNN has significantly less accuracy (73.6%) than in the case of two classes (average 82.0%), while the Ensemble Bagged Tree classifier maintains a high accuracy (88.9%). Additionally, PCA was used to obtain 50, 100, 150, and 200 linear combinations of the 210 original features, which were then fed to the classifiers. The results of the Ensemble classifier using PCA and the wavelet scattered features for four classes using short and long experiments are presented in Table 3. Accuracy only decreased from 89.0% to 84.4% when reducing the number of components from 210 to 50.  Table 4 presents the detailed per-class results for precision, recall, and F1-score for the Ensemble classifier, using 210 features (PCA disabled) and classifying four classes. As the scattering algorithm produces a larger number of scattering windows for each original segment, the latter is classified based on a majority vote count of the class assigned to each of its scattered windows. This can result in higher accuracy, but it is also possible that there is no unique major vote class. This is reflected in the row "No Unique" in the predicted class. Figure 6 shows the confusion matrix and accuracy (95.3%) of original segments classification based on the predicted classes from each scattered window using the Ensemble classifier. Figure 6. Confusion matrix using two dimensions-valence and arousal. No Unique-classifier could not decide due to a draw of votes for the scattered windows of one segment. The results were obtained using short and long experiment data, with 210 wavelet features, the Ensemble Bagged Tree classifier, and majority vote of the scattered windows of each segment.

Short Experiment Result Comparison
In order to make the results comparable with other studies, the methods proposed in this work were also used with the short experiments from the AMIGOS dataset only. The Ensemble classifier achieved an accuracy of 0.902 (arousal) and 0.904 (classifier), and the KNN classifier achieved an accuracy of 0.888 (arousal) and 0.889 (valence), using 10-fold cross-validation and the wavelet-scattering-extracted features. Table 5 summarizes the comparison of these results to other published studies using the same dataset and different classification algorithms. In addition to k-fold validation, subject generalization of the model was assessed with LOSO CV. The Ensemble Classifier achieved an accuracy of 0.819 (arousal) and 0.837 (valence), outperforming the KNN classifier that reach an accuracy of 0.623 (arousal) and 0.586 (valence). These results indicate how well the model would perform with data from a new subject.

Algorithm Performance for Valence and Arousal
Classifier screening shows that classifiers using wavelet-extracted features outperform the ones employing features extracted with traditional time and frequency metrics, both for valence and arousal classification. The increased performance can be explained by the wavelet scattering method's ability to decompose and extract features from different scales of the signal, providing time and frequency resolution, and allowing a classifier to capture the difference between classes [52]. Regarding one-dimension classification, we found slightly better performance for arousal than for valence when considering the same type of features and classifier, similar to other works [6,53]. For both arousal and valence classification, ensemble and 10-KNN classifiers presented better accuracy than the other classifiers. The good performance of the ensemble classifier might be due to its characteristics, since the selected ensemble bagged tree classifier is a boosted aggregation algorithm that creates a collection of decision trees, which are combined in order to reduce the variance of classification result. On the other side, KNN makes clusters of data from the feature vector, which may be close enough for similar classes. Ensemble classifier has previously been reported to have high accuracy for ECG arrhythmia classification [54].

Classification Performance in Two Dimensions
When using two dimensions simultaneously (four classes), the performance of the KNN classifier dropped, possibly due to the similarity or low distance between features of each class. On the other side, the Ensemble Tree classifier accuracy was similar to that achieved in one-dimension classification, possibly due to the classifier being deep enough to capture the differences between classes of the scattering features of each signal.
PCA can be used to reduce the dimensionality of the input vector, but it adds an additional step in the algorithm and tends to degrade classifier performance since reducing the number of variables of a dataset naturally comes at the expense of accuracy. In this work, the accuracy was only reduced from 89.0% to 84.4% when using all 210 features, with only 50 components showing a reasonable trade of accuracy for simplicity.

Scattering Window Classification Performance
The classification using scattering features showed high accuracy for every scenario. This result can be explained by wavelet scattering's ability to represent time and frequency domain features at different scales with each scattered window [52], and then to be fed to a classifier that can take large feature vectors to learn and generate the output classification. Moreover, the original signal segment classification based on the majority vote of the assigned class to each scattered window may inherit the high accuracy of scattered window classification. This method yields a better classifier performance with the drawback of having a tie in some cases as the algorithm would not be able to determine the final class. However, this can be avoided by selecting an odd number of scattered windows and using only two classes.

Short Experiment Result Comparison
Regarding short experiment results, the algorithm also achieves high performance, but lower than the performance with all the datasets. This was somehow expected since the short experiment dataset represents less than 30% of the entire dataset, which may not be enough to achieve high accuracy with the applied classifiers. However, the achieved performance is higher than that reported by other studies.

LOSO Validation
Finally, subject-independent classification using leave-one-subject-out (LOSO) validation yielded lower performance than 10-fold cross-validation, dropping to 58% for KNN and 82% for Ensemble. This result shows that the Ensemble classifier is more robust and may generalize better for new subjects and samples.

Conclusions and Future Work
The extraction of features of the ECG signal by means of wavelet scattering has been shown to improve the performance of classic machine learning algorithms to classify emotions in the arousal and valence dimensions compared to features in the time and frequency domain. The wavelet scattering algorithm allows the ECG signal to be analyzed at different temporal space scales and simultaneously in the time and frequency domains. This allows increasing the separability and differentiability of the signals and patterns of two different classes. The method was validated using the AMIGOS database to classify emotions in two dimensions: arousal and valence. This study demonstrates a higher overall performance than previous works. The classifier comparison shows that the Ensemble Bagged Tree classifier can be a good choice for emotion classification from ECG signals.
Several systems have been proposed to visualize mind states of a human subject based on EEG signals [55]. As a future work, we intend to build a system capable of discriminating emotions in real time using ECG signals. Currently, we are in the process of building a wearable device consisting of first-layer clothing with embedded electrodes for ECG signal monitoring (https://www.auradt.com/productos/primera-capa, accessed on 5 May 2021). For increased comfort and long-term applications, we eliminated the need of conductive gels and skin preparation by integrating dry electrodes in a first-layer t-shirt developed by the local textile manufacturer ACTAS-Grupo CLER (see Figure 7). The prototype is capable of measuring, storing, and streaming data, and we are planning to incorporate real-time signal processing and implement the emotion recognition algorithm in the near future. The major parts and components of the system with a ground station to process and classify the ECG can be represented in the high-level block diagram shown in Figure 8. Wavelet analysis has been successfully applied to other biopotential signals such as EMG [56], thus, the proposed methodology is also a promising alternative for EMG-based pattern recognition problems [57].