1. Introduction
As the World Health Organization’s Director-General, Tedros Adhanom Ghebreyesus remarked, in 2020, that mental health is essential for overall health and well-being [
1]. The outbreak of the COVID-19 pandemic brought new challenges to the issue of mental health. According to the Kaiser Family Foundation’s investigation into the effect of COVID-19 on American life, the respondents were concerned about losing income due to the fact of job loss, workplace closure, or reduced job hours during the pandemic [
2]. Six out of ten adults were concerned about getting an infection or exposing themselves or their family to the virus while working. All of these concerns have negative effects on mental health and emotions. Additionally, according to a survey conducted by Changwon Son’s team [
3], 71% of students in the United States claimed that their anxiety and stress levels increased as a result of the pandemic. A report from the University of Saskatchewan, Canada [
4], focusing on the university’s medical students, also showed a similar result. These findings demonstrate the seriousness of the COVID-19 pandemic’s impacts on mental health. Therefore, in this challenging era of COVID-19, research on intelligent systems that monitor for symptoms of unpleasant emotions building up in a person is becoming more pressing.
An emotion recognition system (ERS) can recognise human emotions and can be used in many different fields. For example, a stress detector that assesses employees’ stress levels using electrocardiogram (ECG) and galvanic skin response (GSR) is proposed in [
5]. A cardiac-based ERS is proposed in [
6,
7,
8] to assess driver stress levels and drowsiness detection in [
9]. Additionally, ERS have been proposed for various uses in the education industry. In [
10], voice-based emotion identification for affective e-learning is proposed. A facial ERS that enables teachers to monitor students’ moods throughout class [
11] and physiological signal-based ERS adoption in an intelligent tutoring system (ITS) [
12] are also found among the works that reported the usage of ERS for education.
From the works discussed above, it can be observed that an ERS can be built using multiple modalities: ECG, GSR, and voice and facial images. Notably, physiological signals are commonly used. Among the physiological signals that are often utilised as ERS modalities are electroencephalogram (EEG) [
13,
14] and ECG [
15,
16,
17,
18]. Some works integrate several modalities for their ERS [
19,
20,
21], while others use a single modality [
17,
22,
23]. Due to the high demand, the number of works on physiological-based ERS utilising wearable devices and noninvasive sensors has also increased. Physiological-based ERS are good for social masking avoidance [
24] and are less prone to fake emotions and manipulation [
25]. The utilisation of wearable devices is supported by the popularity of their usage among consumers. Rock Health surveyed digital health adoption and discovered that wearable device usage has increased significantly, from 24% in 2018 to 33% in 2019 [
26]. Additionally, Statista, a German-based online statistics source predicted that the number of smartwatch users is expected to reach 1.2 million by 2024 [
27]. According to this statistic, the endeavour to build ERS using wearable devices represents a path towards a proper future with significant advancements.
Many labelled emotion databases have been produced in recent years that comprise various modalities [
28], for example, a database for emotion analysis using physiological signals (DEAP) [
13]; a database for affect, personality, and mood research on individuals and groups (AMIGOS) [
18]; a database for decoding affective physiological responses (DECAF) [
20]; and a multimodal physiological emotion database for discrete emotion recognition (MPED) [
16]. Several databases are composed of data collected using nonportable devices and expensive technology; meanwhile, the databases for emotion recognition through EEG and ECG (DREAMER) [
14], wearable stress and affect detection (WESAD) [
29], and emotion recognition smartwatches (ERS) [
30] are compilations of signals collected from wireless, low-cost, and off-the-shelf devices. These databases have been utilised in studies by researchers with different levels of success [
13,
18,
20].
In past research, the issues of racial inequities and bias toward wearable technology, particularly for those with darker skin tones, have been raised [
31,
32]. Those with darker skin tones, tattoos, or arm hair have lower accuracy when using wearable devices that track their health activity or monitor their heart conditions. Noseworthy et al. [
32] recommended that researchers should be aware of racial bias and disseminate study results across demographic subgroups to minimise bias. To the best of our knowledge, there are no existing physiological affective datasets collected from wearable devices that look at this issue and include multi-Asian ethnicities. For example, the DEAP dataset comprises data from European participants [
13], while the MPED dataset consists of data from Chinese participants only.
Thus, this paper introduces the Asian Affective and Emotional State (A2ES) Database consisting of ECG and PPG recordings of 47 participants from various Asian ethnicities. Both ECG and PPG recordings have been reported to be affected by skin colour [
31,
33]. An ECG is used to detect the heart’s electrical activity, which starts from the sinoatrial node to contract the heart muscles for continuing the blood pumping action in the body [
34]. As illustrated in
Figure 1, the ECG comprises three primary components: P wave, QRS wave, and T wave. On the other hand, PPG is a low-cost and noninvasive way to measure blood volume changes in a human during heart activity. PPG has two main components: incoherent light source and photoreceiver [
35]. A typical PPG signal element is shown in
Figure 2, complete with the systolic period associated with blood in-rush, the diastolic period associated with relaxation, and the dicrotic notch associated with pulse reflection [
36]. The subjects that participated in the data collection were exposed to 25 audio–visual stimuli to elicit specific emotions. The self-assessment ratings from the participants and the list of the 25 stimuli are also presented here.
The applicability of the A2ES’s ECG and PPG data for building an ERS was tested using machine learning and deep learning approaches. Five machine learning algorithms, namely, support vector machine (SVM), naive Bayes (NB), K-nearest neighbours (KNN), decision tree classifier (DT), and random forest (RF), were applied. The ECG-based ERS built using SVM and the PPG-based ERS built using RF were found to be the best. The small data size did not suit deep learning, and poor performances were reported.
The rest of this paper is organised as follows. In
Section 2, related works, including ECG- and PPG-based ERS, as well as ECG- and PPG-based databases, are described. The experiment protocol is covered in
Section 3, which includes the stimuli selection procedure, participants’ details, and data collection setting and protocol.
Section 4 describes the data preprocessing and feature extraction process. In
Section 5, an evaluation of the ECG- and PPG-based ERS performances are presented. A concluding discussion and future work directions are provided in
Section 6.
2. Related Works
ECG and PPG are popular modalities for ERS. Many studies using these modalities have achieved promising results in representing human emotions. Bagirathan et al. [
22] utilised ECG signals to recognise positive and negative valence states in children with autism spectrum disorder (ASD). The proposed system successfully obtained an accuracy of 81%. Meanwhile, a PPG-based ERS with a convolutional neural network (CNN) is proposed in [
38] for the fast emotional recognition of valence and arousal. The system achieved a 75.3% and 76.2% valence and arousal accuracy within 1.1 s for short-term emotion recognition. In 2021, Preethi et al. developed a real-time ERS to automate a music selection system using emotion recognized based on PPG signals [
39]. An accuracy of 91.81% was achieved utilising features extracted from phase-space geometry (Poincare’s analysis). For binary classification and multiclass classification, maximum accuracy rates of 96.67% and 91.11% were achieved, respectively. Hasnul et al. evaluated the performance of an ECG-based ERS with the features extracted using two distinct feature extraction toolboxes, TEAP and AUBT, and achieved an accuracy of up to 65% [
17].
ECG and PPG are also commonly integrated with other physiological signals as a strategy to improve ERS performance. In [
40], ECG was used together with temperature (TEMP), galvanic skin response (GSR), electromyography (EMG), respiration (RESP), accelerometer signals, and facial expressions to recognise dimensional emotional states (high arousal and high valence (HAHV), high arousal and low valence (HALV), low arousal and high valence (LAHV), and low arousal and low valence (LALV)), arousal, and valence. The accuracy obtained was in the range of 40 to 70%. Zainudin et al. [
41] proposed stress detection using ECG and GSR signals and categorised them using two approaches: machine learning and deep learning. Their work successfully achieved the best accuracy of 95%. Tian Chen et al. proposed a multimodal fusion ERS that includes EEG and ECG [
42]. The fusion ERS was better than the single-modality ERS, with an accuracy for valence of 85.38% and for arousal of 77.52%.
In [
43], another emotion-based music recommendation engine system was built using a combination of PPG and GSR signals from wearables. The emotional information from PPG and GSR was fed to a collaborative and content-based recommendation engine, and the best accuracy rate obtained exceeded 70%. Domínguez-Jiménez et al. [
44] also proposed an ERS using PPG and GSR from wearable devices. The ERS recognises three emotions: amusement, sadness, and neutral. The system successfully recognised all three emotions, with a testing accuracy of up to 100%. In [
45], a deep physiological affect network, which is a robust physiological model that recognises human emotions using PPG and EEG signals, is presented. The proposed system achieved 78.72% and 79.03% overall accuracy for recognising valence and arousal emotions, respectively.
Although both ECG and PPG signals can be used independently or integrated with other physiological signals, they can also be fused together to upgrade the robustness and improve an ERS’s performance. For example, Li et al. [
46] proposed a group-based individual response specificity (IRS) to improve the emotion recognition performance by fusing the statistical features from ECG and PPG with GSR. The highest performance achieved was 78.06% using the MLP classifier. The authors of [
47] also proposed an automatic ERS with the fusion of the ECG and PPG features and successfully achieved the best performance of 85.70%. Additionally, the fusion of the ECG and PPG features was also used in [
48]. They classified three emotions, positive, neutral, and negative, using a CNN and achieved an accuracy of 75.40%.
In affective computing, existing datasets that collect data from either a single modality or a multimodality using physiological and physical signals are important for the advancement of this field. Existing datasets and their size (number of participants and number of stimuli), type of stimuli, modalities used, devices, and labels are tabulated in
Table 1. Most of the listed datasets contain ECG signals. The ECGs were collected using various devices, namely, Shimmer, Biosemi Active System, Biopac System, FlexComp, Procomp Infinity, and Mobi. Only two of the datasets contain PPG only without ECG: DEAP [
13] and DEAR-MULSEMEDIA [
49]. In these works, the PPG signals were recorded using Biosemi ActiveTwo and Shimmer devices. Four works have both cardio-based physiological signals (ECG and PPG): CASE [
50], CLAS [
51], ECSMP [
52], and K-EmoCon [
19]. The datasets CASE and CLAS contain ECG and PPG signals measured using Thought Technology and Shimmer3, respectively. ECSMP and K-EmoCon used AECG-100 and Polar H7 for ECG and Empatica E4 for PPG. The ECSMP [
52] dataset has the greatest number of subjects, and EMDC [
53] has the greatest number of stimuli compared to the other datasets. Twelve datasets used audio–visual stimuli, making it the most common type of stimuli to elicit emotions. Additionally, most of the datasets used valence and arousal as emotion annotations in addition to basic emotions, such as joy, anger, sadness, fear, disgust, stress, or neutral. A review paper [
25] discusses in detail most of these datasets.
6. Discussion and Conclusions
In this era of COVID-19 and many other challenges, developing an emotion aware system is beneficial for society’s mental health. Therefore, an affective research dataset, A2ES, was proposed in this paper. The dataset consists of ECG and PPG recordings collected from 47 Asian participants from various ethnicities using wearables and off-the-shelf devices. This was conducted to address the lack of such datasets for affective computing research and bias avoidance in future research. The participants were exposed to 25 audio–visual stimuli to elicit specific targeted emotions. The self-assessment ratings from the participants and a list of the 25 stimuli used were included, along with the ECG and PPG performance evaluations using ML and DL approaches. The findings prove the usability of the A2ES for emotion recognition. The performance of ML in classifying emotions using the A2ES with ECG and PPG was better than DL. This was because the size of the data was limited due to the small sample size of the A2ES dataset. The A2ES data are available upon request for other researchers and noncommercial purposes. Although, the data are labelled according to the seven basic emotions of neutral, happy, surprise, fear, disgust, sad, and anger, as well as their intensity, the data can be relabelled to arousal and valence. The data are not tagged to the participants. It is suggested that future research adopting the A2ES should consider different methods of feature extraction and feature selection and reduction to ensure only informative features are applied for more accurate classification, enhanced classification algorithms, and ensemble classifiers, as well as for addressing the imbalance in the data of the different classes. Additionally, to benefit from the strength of DL, the prospective focus should be to enhance the ERS by increasing the size of the data, such as by applying a data augmentation technique to expand the size of the data. The inclusion of the A2ES dataset with other affective computing datasets in building an ERS is expected to lead to an unbiased ERS.