User experience and quality of experience refer to a user and his/her experience with an application, product, or service, UX from the perspective of understanding and interpreting user’s perceptions and answers [1
] and QoE based on the degree of the user’s delight or annoyance, which turns out to be a quality evaluation [2
]. Wechsung and De Moor [3
] carried out an analysis of the differences and similarities between both concepts. UX comes from human–computer interaction and is considered more human-centered because of the way observations are captured and interpreted, for example with standardized questionnaires such as the System Usability Scale (SUS) [4
] or the Self-Evaluation Manikin (SAM) [5
], and non-functional aspects’ analysis such as emotions and other affective states; however, QoE comes from the telecommunications area and is considered more technical because it depends more on technology partly due to its relation to Quality of Service (QoS). Actually, both concepts retain theoretical differences, but in practice, they are converging on some similar evaluation mechanisms. This even suggests consolidating QoE and UX into a broader concept called Quality of User Experience (QUX) [6
], which also includes eudaimonic aspects such as the meaningfulness and purpose of use. It was for this reason that this review included papers as QoE/UX regardless of whether their context was one or the other. QUX was not used because it is a construct still under research and definition.
Traditional QoE/UX evaluation mechanisms are subjective by nature because they are based on techniques that depend on users’ reports and evaluators’ analysis influenced by their perception, criteria, and experience, among other personal factors [7
]. Several evaluation approaches have been proposed for complementing subjective techniques with quality ratings or mental states inferred from user’s physiological and behavioral data (e.g., [11
]). Even though research has been done to interpret the mental states of users when performing certain activities—even critical ones, such as driving, piloting, and air traffic control (e.g., [15
])—the relations between these states and elements of an interface or interaction mechanisms have yet to be identified and adequately represented.
This paper presents a Systematic Literature Review (SLR) carried out to identify and analyze research related to QoE/UX evaluation where cognitive states are interpreted from features of Electroencephalogram (EEG), Galvanic Skin Response (GSR), Electrocardiogram (ECG), and Eye Tracking (ET) (without pupillometry); this includes the machine learning models used, the best results, and the proposed evaluation architectures. Works that analyzed human signal data for searching for correlations between cognitive states and QoE/UX metrics were also considered.
The rest of the paper is structured as follows: The next section presents a background of cognitive states and physiological and behavioral data. Section 3
presents the characteristics of the systematic review protocol. Section 4
describes the final set of articles according to the related topics. Section 5
gives the discussion and findings, and the last section provides the conclusions obtained.
3. Materials and Methods
A systematic literature review is a methodology to identify, evaluate, and interpret relevant research on a particular topic and responding to specific research questions using a replicable and verifiable process [31
In this review, recommendations for individual researchers proposed by Kitchenham and Charters [31
] were followed, and the SLR protocol and the results were submitted to the supervisors of the research work for criticism and revision. Furthermore, this article was structured according to the guidelines of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) Statement [32
3.1. Eligibility Criteria
For the purposes of the review, papers had to be written in English and published between 2014 and 2019. Additionally, the following exclusion criteria were defined:
papers outside the QoE/UX context;
papers recognizing only emotions of the traditional circumplex model of affect [33
papers involving only signal data outside the research scope (fNIRS, fMRI, pupillometry, facial expressions, etc.);
papers involving experiments only with disorder-diagnosed participants, for example: autism spectrum disorder.
This review represents an initial effort to develop a QoE/UX evaluation architecture based on the interpretation of users’ cognitive states. The exclusion criteria were mainly constrained by the research scope–context, mental states, signals, and potential users—considering the equipment and current conditions of our laboratory and the time constraints of the review, among others.
The inclusion criteria considered that the papers had to recognize one or more cognitive states with at least one physiological or behavioral signal, including papers on the correlations between those data with QoE/UX metrics or related to evaluation architectures.
3.2. Search Strategy
The information sources were: Web of Science, ScienceDirect, SpringerLink, IEEExplore, ACM_DL, arXiv, PubMed, and Semantic Scholar. The execution of the queries was carried out in November 2019.
Four search queries were built with different combinations of keywords taken from four main groups: cognitive states, data from various signals, machine learning, and user experience (Table 1
The keywords within each group were connected using the OR operator and the groups with the AND operator; the four group combinations for the search queries were:
cognitive states AND data AND machine learning AND user experience;
cognitive states AND data AND user experience;
cognitive states AND user experience;
cognitive states AND data AND machine learning.
The last query was not performed in Web of Science due to problems with institutional access to the repository. In Semantic Scholar, issues with exact phrase filters were observed, and consequently, only the first query was carried out. ScienceDirect restricts a maximum of eight connectors in each query, so the most representative keywords of each group were chosen.
3.3. Study Selection
The papers resulting from each query were analyzed through the process: (1) duplicate check; (2) evaluation of exclusion criteria based on the title, abstract, and keywords; and (3) evaluation of the eligibility criteria based on the full text. This process was carried out individually and not peer-reviewed; only the results were reviewed by the supervisors of the research work.
The papers that did not meet the eligibility criteria were recorded and labeled as discarded. The papers that passed Stage (3) were tagged as considered and stored using the Mendeley Desktop reference management software.
As shown in Figure 1
, a total of 858 records were initially identified. Later, two-hundred seventy-six duplicates were removed, and five-hundred fifty-three records were discarded because they did not meet the eligibility criteria, leaving 29 papers for detailed analysis and data extraction.
3.4. Data Extraction
Different data were extracted from the final selection of papers: general data (e.g., authors and institutions of origin, name of the journal or conference), experiment data (e.g., number and characteristics of participants, stimulus, cognitive states, equipment, signals), data related to classification models (e.g., types of machine learning models, features extracted from signals, performance), data related to QoE/UX evaluation architectures (e.g., modules, proposed layers, representation of results), and data related to the obtained results (e.g., findings, conclusions). The registration was initially done on a spreadsheet and later using the Notion software.
In this review, we identified 29 papers within the context of QoE/UX evaluation related to the recognition of cognitive states and published between 2014 and 2019.
Experiments with different signals and number of participants were identified: EEG data from 4 participants [40
], ET data from up to 136 participants [39
], or acquiring data from various signals from up to 61 participants [34
]. Figure 3
shows the distribution of the number of participants in experiments that collected EEG, ECG, or GSR data and of ET with medians of 20, 42.5, 24, and 33.5, respectively. If more than two signals were used in the experiment, this was considered in an independent way per signal. Atypical values were observed in EEG and ET, denoting that a high number of participants with these signals is not common.
None of the QoE/UX approaches that address the recognition of cognitive states from physiological and behavioral data use deep learning models in some part of the process. Good results have been observed in other contexts with architectures of the autoencoder type (e.g., [63
]) and of the convolutional type (e.g., [65
]); however, this can be complicated if the number of participants in the experiments is reduced since the deep learning models require a significant amount of data to take advantage of their potential [67
]. Only two of the investigations [37
] considered techniques such as SMOTE or the Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN) [68
] for data augmentation and class balancing. The use of other techniques or models to generate synthetic data was not identified, such as those based on Generative Adversarial Nets (GANs) [69
], which are being studied and evaluated in other contexts (e.g., [70
In general, research does not report the preparation time spent dedicated to each participant. The number of participants may be limited by the type and number of measuring devices that must be configured. On non-invasive EEG devices, in the form of a headband or cap, a greater number of electrodes can imply more time for placement and calibration for each participant. On ET devices, the calibration time is usually shorter, although the lighting conditions in the environment should be considered to a greater extent. In the case of cardiac activity monitoring, a large amount of information and precision are obtained with ECG, whose electrodes are usually placed on the chest or arms, with the disadvantage that these sensors are more intrusive and that their installation requires a stricter protocol compared to those of devices that take heart rate measurements based on PPG. In the case of GSR, sensors are usually placed on the arms, fingers, or forehead, spending little time on its preparation.
To properly select the type and quantity of metering devices used in QoE/UX evaluations, Zeagler’s [72
] recommendations can be taken into account for wearable devices and those of Erins et al. [73
] in the context of fatigue detection, as the intrusiveness and interference with the task must be minimal, and for this, it is necessary to consider aspects such as the perception of weight, user movement, acceptability, the mobility and availability of the sensor, and susceptibility to the environment, among others. Even before determining the sensors to use, it is necessary to evaluate the convenience of measuring the set of cognitive states proposed in a certain application, and for this, we can initially consider the attributes contributed by Charlton [74
] related to sensitivity, intrusion, diagnosis, convenience of measurement, relevance, transferability, and acceptance.
In the experiments, the age and sex of the participants were reported, but conclusions related to these aspects were not presented. It has been observed that individual differences given by various factors, such as demographics or experience in the task, can influence physiological and behavioral signals [75
]; however, few studies consider these factors (e.g., [76
]). Figure 4
shows the proportion of the sex of the participants considering all the experiments related to each signal, and a majority of male participants was observed in EEG, ECG, and GSR, being more equitable in ET; in EEG, the average difference of participants of each sex was 29%, in ECG 36%, in GSR 24%, and in ET 21%. This reaffirms what was found in [62
]: standardized experiments are not performed, and the lack of uniformity makes it difficult to establish comparisons between the results.
On the other hand, we identified that the generated datasets are not available for later tests or validations; in this sense, the requirements presented by Mahesh et al. [77
] can be generalized to build reference datasets.
The research related to the classification of cognitive states included the following states: mental workload [34
], engagement [35
], confusion [37
], attention [36
], and mental stress [22
]. Table 2
shows the machine learning models with the best performance. Despite that results with accuracies above 90% (e.g., [34
]) have been obtained, the classification is based on the interpretation of the user’s cognitive state when responding to the stimulus in general, without studying its relation with specific elements of the interface or the interaction when using an application, adding the difficulties of understanding the relationship of these states with the user’s perception of quality based on the characteristics and changes in the stimulus.
Several papers identified correlations between different physiological and behavioral signals with aspects such as experience and difficulty in the task [46
], performance [48
], and perception of quality [50
], among others, and with cognitive processes [50
] and states such as engagement [53
], mental workload [55
], and attention [56
]; however, the usefulness of self-report questionnaires persists and is highlighted, supporting the idea that QoE/UX evaluation mechanisms should be complemented with mixed approaches such as the use of standardized questionnaires and the interpretation of physiological and behavioral signals.
The analyzed evaluation architectures considered several types of sensors and the detection of various mental states: Hussain et al. [41
] emphasized the features and independent performance of the models used in each detection module; Courtemanche [42
] highlighted the importance of tools to represent users’ mental states and their usefulness with respect to the evaluators who interpret them and considering the requirements that the industry demands [45
]. In general, architectures define modules or layers for data capture and their processing, for the analysis and calculation of metrics, and for the generation and presentation of results, where the process starts with the user performing a task and ends with an expert evaluator interpreting the results and generating or complementing a final report with the findings detected in the test.
The presented review has some limitations. The planning and execution of the search and the selection and analysis of the results were not carried out in a scheme of peer validation, with review and criticism from supervisors, but keeping the intrinsic bias of an individual researcher. The number of analyzed papers was modest given the restrictions to the QoE and UX contexts; the aim was to cover both topics given their similarities in the way of evaluation with physiological and behavioral signals, finding generalized results and not independently detailed.