Datasets for Automated Affect and Emotion Recognition from Cardiovascular Signals Using Artificial Intelligence— A Systematic Review

Simple Summary We reviewed the literature on the publicly available datasets used to automatically recognise emotion and affect using artificial intelligence (AI) techniques. We were particularly interested in databases with cardiovascular (CV) data. Additionally, we assessed the quality of the included papers. We searched the sources until 31 August 2020. Each step of identification was carried out independently by two reviewers to maintain the credibility of our review. In case of disagreement, we discussed them. Each action was first planned and described in a protocol that we posted on the Open Science Framework (OSF) platform. We selected 18 works focused on providing datasets of CV signals for automated affect and emotion recognition. In total, data for 812 participants aged 17 to 47 were analysed. The most frequently recorded signal was electrocardiography. The authors most often used video stimulation. Noticeably, we did not find much necessary information in many of the works, resulting in mainly low quality among included papers. Researchers in this field should focus more on how they carry out experiments. Abstract Our review aimed to assess the current state and quality of publicly available datasets used for automated affect and emotion recognition (AAER) with artificial intelligence (AI), and emphasising cardiovascular (CV) signals. The quality of such datasets is essential to create replicable systems for future work to grow. We investigated nine sources up to 31 August 2020, using a developed search strategy, including studies considering the use of AI in AAER based on CV signals. Two independent reviewers performed the screening of identified records, full-text assessment, data extraction, and credibility. All discrepancies were resolved by discussion. We descriptively synthesised the results and assessed their credibility. The protocol was registered on the Open Science Framework (OSF) platform. Eighteen records out of 195 were selected from 4649 records, focusing on datasets containing CV signals for AAER. Included papers analysed and shared data of 812 participants aged 17 to 47. Electrocardiography was the most explored signal (83.33% of datasets). Authors utilised video stimulation most frequently (52.38% of experiments). Despite these results, much information was not reported by researchers. The quality of the analysed papers was mainly low. Researchers in the field should concentrate more on methodology.


Introduction
Facilitating access to databases seems to be an essential matter in the field of machine learning (ML). Publicly available, reliable datasets could drive research forward, making it unnecessary to re-run similar yet complicated experiments in order to obtain sufficient data. Credible work relies on proper arrangement, validation, adjustment, and fairness in artificial intelligence (AI) [1,2].
Moreover, sufficient descriptions of the scientific methods in AI are a constant challenge. It seems to be particularly valid in automated affect and emotion recognition (AAER) studies, which fall under the field of human-computer interaction (HCI), linking psychology, computer science. and biomedical engineering. As human emotions affect multiple channels, research on this topic is being conducted based on speech, facial expressions, gestures or physiological signals, which became exceptionally popular in the last decade [3].
Increasing interest in the field, among others, comes from broad application prospects. Recent studies point out the potential usage of emotion recognition techniques in medical fields, public security, traffic safety, housekeeping, and related service fields [4].
The topic is extensive, as it covers both data acquisition and computation. A typical experiment in AAER involves several steps [5]. Firstly, the researchers need to adopt a specific perspective on the field, as many exist that consider the universality [6,7] of emotions or their structure [8]. The theoretical approach imposes an understanding of emotions, selections of material used for stimulation, and interpretations. However, the general structure of elicitation experiments that are carried out to gather the data from human participants remains stable [9]. To evoke emotions, passive (e.g., video, music, or pictures presentation) or active stimulation (e.g., game playing, interaction with virtual reality, or conversation) is used [5]. Eliciting material may have different lengths, types, and quantities. After the stimulation phase, the subjects are asked how they felt. Several validated instruments enable it, e.g., Self-Assessment Manikin (SAM) [10].
As the data collection process in experiments within this field is complex and multistage, the problems may occur on many levels. It is thus crucial to plan the experiment and report upon it in adequate detail [32].
The replicability crisis in both psychology and computer science also affects studies on AAER [5,33,34]. Poor methodological conduct often makes it impossible for existing research to be replicated or reproduced. Even in renowned and well-established research that dictates the social order, the phenomenon is widely present [35,36].
Datasets collected inadequately might contribute to lowering the credibility of emerging research (influencing model development by introducing undesirable biases) and waste of time and resources. This issue has been widely discussed before and is known as the garbage in, garbage out problem [37,38]. Avoiding bias and proper validation of experiments are crucial to eliminating it [32].
Promisingly, publishing source codes and data is becoming a desirable standard in computer science [39][40][41][42]. Journal initiatives [43,44] on the topic emphasise the importance of computational research reproducibility and promote open research. In turn, preregistration of the research plan, taking into account the hypotheses and defining step-by-step the methodology allows for improving the quality of the research and its reproducibility from a psychological perspective [45,46].
To create a reliable model presenting a high degree of emotion or affect recognition precision, it is relevant to limit external and internal factors potentially confounding the collected measurements [47,48].
The confounding effect of incomplete control may arise from any stage of the study. For instance, subjects with somatic disorders might affect measures of features, mood disorders, or alexithymia, which is estimated to affect 13% of the population [49].
Each stage of an experiment leading to AAER should be repeatable and standardised among subjects. AAER concerns stimuli presentation, assessment of elicited emotions by the subject, collection of physiological parameters, and laboratory environment, including the presence of experimenter and individual factors [5].
While measuring emotional and affective responses in the laboratory environment using objective methods reduces the risk of self-reported bias, the risk of contextual nonintegrity remains. This creates the need to document all the contextual environmental aspects that could influence the measurement [50].
Along with the pervasiveness of wearable devices available to register user psychological parameters during daily activities, AAER is reached [51,52]. Wearable devices are proven to measure efficiently CV signals while being offered at low prices [53,54]. However, the challenge remains to design credible ML models able to deal with the broad spectrum of possible emotions and lack of universality in this category among cultures [7].
Studies on ubiquitous computing are growing in number [55][56][57]. Due to the constraints of time and human resources, all these results could not be read. Therefore, creating summaries along with the analysis of evidence is now necessary [58]. Describing the data together with a critical appraisal helps to determine, for example, the actual accuracy of the methods and to highlight those articles whose results are derived from a high-quality methodological process. The selection of studies answering a similar research question may be chaotic, purposeful, or systematic [59]. The latter method reduces the risk of researchers steering conclusions, as it follows restrictive, transparent criteria [32,60,61].
Because of the above and since previous similar studies on AAER were of weak reliability [5], we decided to present a systematic review on the topic, corresponding to approved standards, to limit the risk of bias (RoB). We review public datasets available for AAER with the use of AI, utilising physiological modalities as an input with the focus on CV signals. This paper is a part of the project on a systematic review of studies focused on AAER from CV signals with AI methods. For more details, see the protocol [62] and our previous conference paper [63].

1.
What are the datasets used for AAER from CV signals with AI techniques? 2.
What are the CV signals most often gathered in datasets for AAER? 3.
What were other signals are collected in analysed papers? 4.
What are the characteristics of the population in included studies? 5.
What instruments were used to assess emotion and affect in included papers? 6.
What confounders were taken into account in analysed papers? 7.
What devices were used to collect the signals in included studies? 8.
What stimuli are most often used for preparing datasets for AAER from CV signals? 9.
What are the characteristics of investigated stimuli? 10. What is the credibility of included studies?

Eligibility Criteria, Protocol
Papers in which more than half of the sample constitutes a specific population (e.g., children or people with illness) were excluded. All experiments needed to be carried out in laboratory settings. We considered any type of publication to be eligible in which CV signals and AI methods were used for AAER. The primary focus of our whole project [63] was the performance of these computer programs (e.g., specificity, sensitivity, accuracy). For this focused systematic review, we imposed additional inclusion criteria, namely public availability of the data.
Due to double referencing, some of the references were overlapping. These were post-conference books and full proceedings. We excluded them as they contained little information about specific chapters. Nevertheless, we did not reject these particular sections. We excluded introductions to Special Issues in a journal or section, letters to editors, reviews, post-conference books, full proceedings (but not qualified papers), and case studies.
The review protocol was published on the Open Science Framework (OSF) [64] and then registered there [62] on 18 March 2021. All additional information about methods can be found in the protocol.

Search Methods
We searched article databases (MEDLINE, Web of Science, dblp, EMBASE, Scopus, IEEE, Cochrane Library) and preprint databases (medRxiv, arXiv). The complete search was done on 31 August 2020.
To develop the MEDLINE strategy (see protocol on OSF [62]), we combined MeSH (controlled vocabulary) and free-text words related to AAER, CV signals, and AI. Then, these strings were translated for other sources utilised in the search. We adopted no date or language restrictions.
Additionally, we screened full texts of included papers for otherwise not identified studies. We included them in further steps of identification.

Definitions
We used the following definitions. AAER [65,66] refers to finding patterns with specific signals (e.g., behavioural, physiological) consistent with detected states. AI refers to software able to perform tasks as accurately as intelligent beings (e.g., humans) [67]. DL refers to the architecture of neural networks comprising at least two hidden layers [68]. Performance metrics, which refer to a mathematical evaluation of model predictions with ground truth [69]. CV signals refer to an electrocardiogram (ECG), pulse oximetry (POX), heart rate (HR), intracranial pressure (ICP), pulse pressure variation (PPV), heart rate variability (HRV), photoplethysmogram (PPG), blood volume pressure (BVP), and arterial blood pressure (ABP) [53,70].

Data Collection
EndNote (Claritive Analytics ® ) and Rayyan [71] were utilised for deduplication of identified references. P.J., D.S., M.S., and M.M. used the Rayyan [71] application to screen the remaining references independently. Subsequently, full texts were assessed separately by P.J., D.S., M.S., and M.M. for meeting inclusion criteria . P.J., D.S., M.S., M.M., W.Ż., and M.W.G. collected all necessary data independently using a pre-specified extraction form. We gathered bibliographic data (e.g., year, journal name) and information about authors, funding, and conflicts of interest. We also focused on population, models, and outcomes-AI methods and additional analyses, e.g., interpretability, as specified in the protocol (see OSF [62]).
Pilot exercises were conducted before each phase, namely screening of abstracts and titles, full text evaluation, and extraction of the data. By doing so, we aimed at improving the sense of understanding among the reviewers. When discrepancies occurred (at each step of data identification), they were resolved via discussion.

Quality Assessment
The methodological credibility of included studies was assessed using a tool developed by our team (see Appendix C). The method was based on well-grounded techniques, namely Quality Assessment of Diagnostic Accuracy Studies (QUADAS) [72], Prediction model Risk Of Bias ASsessment Tool (PROBAST) [73], and an instrument provided by Benton et al. [74] as it was dedicated to the same study design as included by us. The process of evaluation was preceded by pilot exercises. We rated RoB independently in pairs (P.J., D.S., M.S., M.M., W.Ż., and M.W.G.). Discussion resolved all discrepancies.
The utilised tool constituted of eight questions (items): 1.
Was the sample size pre-specified? 2.
Were eligibility criteria for the experiment provided? 3.
Were all inclusions and exclusions of the study participants appropriate? 4.
Was the measurement of the exposition clearly stated? 5.
Was the measurement of the outcome clearly stated? 6.
Did all participants receive a reference standard? 7.
Did participants receive the same reference standard? 8.
Were the confounders measured?
Items were assessed using a three-point scale with the following answers: yes/partial yes, no/partial no, and not reported resulting in high, low, or unclear RoB. For more details, see Appendix C.

Analyses
We concentrate on descriptive synthesis regarding characteristics of populations and collected datasets, i.e., stimuli, signals, devices, emotions, and affect. We also present results regarding credibility.
The quantitative summary with sensitivity, heterogeneity, and subgroup analysis of all papers is not the purpose of this focused review. For more details, please refer to the protocol [62] and other papers from the project [63].

Results
From 4649 records, we identified 195 studies that met our eligibility criteria. Then, we selected a sub-sample of 18 papers. Each paper provides one validated, publicly available dataset, including CV signals with labels regarding emotions or affect.
Supplementary File S1 (OSF [64]) and Appendices A and B contain the list of all included studies, the subgroup of datasets analysed in this review, and the excluded studies with reasons, respectively. The remaining included studies are considered in other articles from the project [63]. The flow of our study is presented in Figure 1. Our reporting is consistent with Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) guidelines with diagnostic test accuracy (DTA) extension [93].
None of the studies provided source code of executed analyses, while only one study (5.56%) reported registering protocol [92].
All of the datasets were validated in classification experiments. Authors of only two datasets (11.11%) [86,89] compared their results with other publicly available data.

Population
The total number of analysed people was 916, with a mean number of 43.62 participants and a range from 3 [80] to 250 [91]. However, due to, e.g., missing data, the datasets contain complete information for only 812 of them.

Credibility
The general RoB was analysed in two scenarios-with or without the first item of the proposed tool (see Section 2.5). We excluded the first question in the second condition because none of the included studies reported on pre-specification of sample size.
All ratings are presented in Table 4. Among them, the most frequent was yes, marked in 43.06% of cases. However, the second most prevalent was not reported, which was assessed 27.08% times.

Additional Analyses
Please refer to the protocol [62] and our other papers [63] from the project on AAER from CV signals with AI methods for additional analyses.

Discussion
The paper search conducted in this study revealed that there are 18 publicly available validated datasets for AAER from CV signals. The methodological credibility assessment showed that only two studies are of high quality, suggesting a significant need for developing good scientific practices. Furthermore, none of the studies provided a source code used for the validation experiments. It opens a discussion on replicability, which we are witnessing in science nowadays [5]. Experiments in included papers were conducted on small samples. The number of participants exceeded one hundred only in one study.
What is more, the subjects' background information was poorly described. Only four studies established that the participants were either Chinese or European. According to Wierzbicka [94,95], the history behind a person (and the language he or she speaks) may play a crucial role in the emotional states they experience and thus should be controlled. Feldman later disseminated this belief in her approach [7].
Another bothering aspect of the analysis is that an ethical commission approved experiments described by only four papers, and only one study mentioned ensuring the privacy of participants. It lights up red flags in terms of maintaining ethical standards or suggests negligence of reporting crucial information. Authors of experimental studies should more carefully examine this aspect.
Additionally, the authors either selectively controlled the influence of potential confounders or did not do so at all. Various CV diseases, mental disorders [49], and participants' moods and personalities may affect AAER from physiological signals [78]. Therefore, we believe authors should include such information.
The problem in assessing quality in systematic reviews is about distinguishing how much the authors did not take care of the methodological regime and how much they did not report the details of the research process [60]. Therefore, it is recommended that when submitting an article to the journals' editorial office, the authors fill in a checklist and mark the exact place where they have included the minimum necessary descriptions of the operation process [32,96].
On the other hand, we observed great diversity in the choice of physiological signals, stimuli type and length. What is more, 38.89% of the studies used wearable devices to perform measurements. Considering the increasing popularity and facility of these instruments [78], it gives the excellent potential for future adoption of proposed methods in real-life scenarios. Thanks to recent advances in the field of sensors technology, such devices are well-suited for daily usage. They do not require complicated instalments, are comfortable to wear, and are easy to use [97]. However, one should remember that there are still many limitations standing in the wy of the wider use of wearable devices in AAER. First of all, the quality of physiological signals is still noticeably lower than that of medical-level equipment [98]. What is more, the data gathered by such instruments in non-laboratory settings are often flawed, with noise coming from motion or misplacement [99].
Similarly to our study, the CV databases were also explored by Merone et al. [100]. The authors investigated 12 datasets with the inclusion criteria of having an ECG signal. In addition, they analysed included sets in terms of many parameters, e.g., the number of ECG channels and electrodes type. However, they did not primarily focus on emotions or affect. They included only one paper [101] covering this scope, which we did not consider eligible for inclusion as it did not meet the criteria. Since datasets including CV signals are still unexplored, we cannot discuss our results with other authors. Furthermore, Hong et al. [102] analysed ECG data systematically using DL. Still, they identified only one study about AAER [103], but it was not in their primary interest, so they did not describe it in detail.
In line with these results, in the current literature, we found a shortage of highly credible and methodologically reliable publications and thus datasets that could form the basis of further AI research. This review shows a need to create guideline-compliant datasets with a transparent, fully reported methodology and limited RoB.
Models able to accurately recognise emotions using physiological parameters can contribute to the development of many disciplines. They create the possibility of reaching more advanced levels of HCI, where a computer (or system, in general) can modify its behaviour depending on the identified interlocutor's state and choose the reaction closest to natural social schemes [112].
While using wearable devices, users might be supported in maintaining a psychological and healthy life balance, e.g., by identifying sources of stress, anxiety, or tension during their everyday activities and receiving feedback about their organisms reactions and resources [113]. Furthermore, assessments made on the basis of their CV signals can be used to investigate the impact of different emotional and affective states on the risk of developing CV diseases [114].
Well-validated AI models can significantly support research in the field of health and medical sciences and emotion theory by facilitating the simple, quick, and more matter-of-fact evaluation of emotions and other states and, therefore, reducing the RoB resulting from participants' incorrect reporting.
Among the implications of our study, we should first include the recommendation to incorporate current, reliable guidelines and standards in the methodology development process and use quality assessment and reporting tools, as this translates into more reliable data, which may result in developing better recognition models [32]. For primary studies, we suggest following the proposed checklist for RoB (see Appendix C) or other available tools, e.g., [32].

Strengths and Limitations
The performed review has high standards [32,60,61,115]. The research question was precisely defined. We utilised multiple resources for collecting studies mentioned in Section 2. Inclusion and exclusion criteria were firstly discussed and recorded. Researchers who participated in this review have knowledge in multiple disciplines: computer science, psychology, HCI, medicine, and methodology. To ensure transparency, we provide all necessary information in the Appendices and Supplements with a permanent DOI [64].
On the other hand, we did not search any Chinese databases. Considering the growing amount of evidence in this language, we might not have considered a large amount of evidence and thus weakened our conclusions. Moreover, the use of the search strategy itself and the stages of identifying articles based on titles and abstracts may be a limitation. Due to such action, we may miss an extraordinary piece of work that did not meet our criteria due to its original form.

Conclusions
This paper systematically reviewed the datasets that include CV signals for AAER with AI methods and assessed their quality.
Due to poor reporting and not following methodological guidelines, the evidence, however, is limited. Nevertheless, according to our review, the most up-to-standards research was proposed by Correa et al. [78] and Marin et al. [85].
In the future, more attention should be put into controlling bias in research to ensure incremental knowledge gain. The quality of papers and reporting needs to be improved in order to propose and develop models that do not introduce biases. Preferably, authors should focus more on methodology and describe procedures thoroughly. We recommend following standardised guidelines of reporting [116].
Our next steps include the synthesis of gathered evidence with other physiological signals. Furthermore, we want to propose our own unbiased dataset for AAER for public use. Based on these data, we plan to improve our affective games [117][118][119].

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix C. Risk of Bias Tool
Yes/partial yes The experiment was preceded by calculating the minimum sample size, and the method used was adequate and well-described.
No/partial no It is stated that the minimum sample size has not been calculated, or it has been calculated, but no details of the method used are provided.

Not reported
No sufficient information is provided in this regard.
Sample [74] 2. Were eligibility criteria for the experiment provided?
Yes/partial yes The criteria for inclusion in the experiment are specified.
No/partial no The criteria for inclusion in the experiment were used, however not specified in the article.

Not reported
No sufficient information is provided in this regard.
Participants [73] 3. Were all inclusions and exclusions of participants appropriate?
Yes/partial yes The criteria for inclusion and exclusion are relevant to the aim of the study. Conditions that may affect the participant's state or collected physiological signals and ability to recognise emotions were considered, including cardiovascular and mental disorders.
No/partial no The established criteria for inclusion and exclusion are irrelevant to the aim of the study.

Not reported
No sufficient information is provided in this regard.
Measurement [74] 4. Was the measurement of exposition clearly stated?
Yes/partial yes The selection of stimuli is adequately justified in the context of eliciting emotions, e.g., selection from a standardised database, pilot studies.
No/partial no The selection of stimuli was carried out based on inadequate criteria.

Not reported
No sufficient information is provided in this regard.
Measurement [74] 5. Was the measurement of outcome clearly stated?
Yes/partial yes The assessment tool used for emotions measurement is described in detail, adequate, and validated.
No/partial no The assessment tool used for emotions measurement is not described, or the measurement method is inadequate, or not validated.

Not reported
No sufficient information is provided in this regard.

Flow and
Timing [72] 6. Did all participants receive a reference standard?
Yes/partial yes Emotions were measured in all participants, and the measurement was performed after each stimulus.
No/partial no Not all participants had their emotions measured.

Not reported
No sufficient information is provided in this regard.

Flow and
Timing [72] 7. Did participants receive the same reference standard?
Yes/partial yes The same assessment standard was used in all participants who had their emotions measured No/partial no A different assessment standard was used in some of the participants to measure their emotions.

Not reported
No sufficient information is provided in this regard.
Yes/partial yes Adequate confounding factors were measured, and relevant justification is provided.
No/partial no The control of confounding factors is not justified, or the measured factors are inadequate.

Not reported
No sufficient information is provided in regard to confounding factors.

Scenario 1:
Overall quality (elicitation) Scenario 2: Overall quality (without judgement of 1. item) High All judgements are yes or partial yes.

Low
At least one judgement is no or partial no.

Unclear
All judgements are yes or partial yes with at least one not reported. 1 the specific domain was based on an instrument provided in the reference.