Advancing Stress Detection Methodology with Deep Learning Techniques Targeting UX Evaluation in AAL Scenarios: Applying Embeddings for Categorical Variables

: Physiological measurements have been widely used by researchers and practitioners in order to address the stress detection challenge. So far, various datasets for stress detection have been recorded and are available to the research community for testing and benchmarking. The majority of the stress-related available datasets have been recorded while users were exposed to intense stressors, such as songs, movie clips, major hardware/software failures, image datasets, and gaming scenarios. However, it remains an open research question if such datasets can be used for creating models that will effectively detect stress in different contexts. This paper investigates the performance of the publicly available physiological dataset named WESAD (wearable stress and affect detection) in the context of user experience (UX) evaluation. More specifically, electrodermal activity (EDA) and skin temperature (ST) signals from WESAD were used in order to train three traditional machine learning classifiers and a simple feed forward deep learning artificial neural network combining continues variables and entity embeddings. Regarding the binary classification problem (stress vs. no stress), high accuracy (up to 97.4%), for both training approaches (deep-learning, machine learning), was achieved. Regarding the stress detection effectiveness of the created models in another context, such as user experience (UX) evaluation, the results were quite impressive. More specifically, the deep-learning model achieved a rather high agreement when a user-annotated dataset was used for validation.


Introduction
The collaborative research efforts of several domains, such as human−computer interaction, ubiquitous computing, ambient intelligence, and the Internet of Things, have significantly raised interest in emotional aspects of software products [1]. Considering typical living environments and commercial off-the-shelf (COTS) sensors and actuators, User eXperience (UX) enables academics and practitioners to gain a deeper understanding of the users' interaction experiences by employing tools and approaches that go beyond traditional usability metrics [2]. UX is a broad term that includes factors such as usability, usefulness, aesthetics, and emotions [3]. In many cases UX design begins before the product is even in the hands of the user, effectively anticipating the end user's needs and requirements. Designing and developing for UX necessitates a thorough grasp of how users feel while interacting with a system or a product [4].
A variety of approaches, such as post-questionnaires, interviews and observation can be used in order to measure the emotional aspects of UX. Alternatively, modalities that can be acquired by embedded and robotic sensors and actuators, such as facial expression [5,6], speech tone [7] and touchscreen patterns analysis [8,9] have been proposed. In the same way, physiological signals monitoring (e.g., heart rate, respiration, skin conductance) through COTS sensing equipment such as electrocardiogram (ECG), oxygen saturation (SpO2), and galvanic skin response (GSR), respectively, is also an approach that has been adopted by researchers in the context of UX evaluation [10].
Targeting an interactive environment such as ambient assisted living (AAL) in a UX evaluation, one is mostly interested in preventing stress events which are related to software issues [11]. Software flaws can cause the undesirable activation of the users' physiology, widely referred as a "fight or flight" event or stress [12].
From typical modalities applied in a wide range of application scenarios, including AAL environments, skin conductivity (SC), also known as galvanic skin response (GSR), is one of the most well-studied psychological markers of the functioning of people's autonomic nervous system. SC is associated with both emotional responses as well as cognitive activity. Signal characteristics, such as peak height and instantaneous peak rate, are reliable indicators of the stress level of a user. In [13], an extensive summary of SC research in relation to stress is presented. Furthermore, skin temperature (ST) is also a signal that can be easily measured. ST normally ranges between 32 and 35 • C [14] and has been used in numerous studies for emotion detection. In contrast to using GSR, there is ambiguity concerning the impact of stress on skin temperature fluctuations. More specifically, some studies confirm that ST rises when experiencing stress [15], while other studies [16,17] argue that ST decreases under stress. In the present study we used both GSR and ST signals during the training process of our stress detection models.
The evolution of miniaturized embedded systems and wearables [18,19] has further favored the use of physiological signals, allowing experiments to take place in more ecologically valid settings [20] at a relatively low cost [21].
There is a large number of publicly available physiological datasets [22][23][24] for stress research that have been emotionally annotated in a context where users have been exposed to the intense stressors typically found in real-life conditions (e.g., movie clips, songs, major hardware/software failures, image datasets, and gaming). Although such approaches are able to create stress prediction models with rather high classification accuracy, it remains questionable if they could be effectively used in capturing subtle stress responses, which are mostly expected in different contexts, such as UX evaluation studies [25].
This paper tries to answer the aforementioned question by conducting an in-depth investigation of the performance of such a dataset in the context of UX evaluation and modalities by combining traditional machine learning and deep-leaning approaches. To the best of our knowledge, this is the first study that uses deep-learning for stress detection in the UX context. More specifically, the SC and ST signals of 15 users of the publicly available physiological dataset named WESAD (wearable stress and affect detection [26]) were used in order to train three machine learning classifiers (L-SVM, C-SCM, Q-SVM) and a deep-learning model implemented using the fast ai framework [27]. The fast ai library has a built-in functionality based on neural networks that allows the classification of tabular data with very good results. Tabular data, also known as relational data or structured data is the most commonly used type of data but deep learning on tabular data receives far less attention than deep learning for computer vision and natural language processing. Thee SC and ST signals of WESAD dataset were preprocessed to extract the appropriate features as described in Section 3 in detail and the tabular data that derived were provided as an input for the stress/no stress problem.
A publicly available emotionally annotated biosignals dataset, made available by [28], was considered as the ground truth dataset in order to evaluate the performance of the aforementioned models in stress detection in the context of UX evaluation. The ground truth dataset consists of SC segments that have been emotionally annotated by users' self-reported ratings (valence-arousal scale). The reported periods indicate usability issues confronted by users while they were interacting with a platform during a UX evaluation study. Using users' self-reporting as ground truth is a common practice in UX research [29][30][31][32]. The rest of the paper is structured as follows: Section 2 presents the WESAD dataset that was used for the training process. In Section 3 the process of stress detection models is presented, using both deep-learning and traditional machine learning as well as the training results. In the following section the models created are used in another context (UX evaluation) and their performance is shown.

Wearable Stress and Affect Detection (WESAD) Dataset
WESAD [26] is a publicly available multimodal physiological dataset proposed for wearable stress and affect detection. The signals were recorded during a lab study in which 15 participants with a mean age of 27.5 years (SD = 2.4) were exposed in three different affective states: neutral, stress, and amusement. Regarding stress, the Trier Social Stress Test (TSST) was employed by the researchers in order to elicit the specific emotion. More specifically, the dataset consists of the following physiological signals: blood volume pulse (BVP), electrocardiogram (ECG), electrodermal activity (EDA), electromyogram (EMG), respiration (RESP), skin temperature (TEMP), and three axis acceleration (ACC).
Regarding the binary classification problem (stress vs. nonstress), a classification accuracy of up to 93% was reported by authors when all physiological signals participated in the training process. Classification was also conducted by using only electrodermal activity and skin temperature data. In these cases, the accuracy was 80 and 59%, respectively. Towards a better tradeoff between recognition performance and computational load, Liu et al. [33] used a single biosignal in order to create a more practical, unobtrusive and comfortable wearable system for stress detection. In particular, the skin conductivity along with linear discriminant analysis (LDA) were used in order to discriminate three stress levels: low, medium and high. A classification accuracy of 81.82% was achieved. Furthermore, Jussilla et al. [34] proposed an effective stress management biosensor named the "smart ring". The smart ring measures EDA from the palmar side of the wearers. Such approaches might be a promising line of research for the development of practical personal stress monitors. This is also our rationale to use only two signals.
Despite the high classification accuracy in both approaches (all signals vs. EDA), WESAD authors indicate that "results should be interpreted with caution due to the limitations of WESAD, regarding the number of subjects and the lack of age and gender diversity".

Training and Results
In this section, the process of stress detection models is presented. The trained models can be divided into two groups: (a) deep-learning and (b) traditional machine learning.
While deep learning has revolutionized the processing of unstructured data (e.g., audio, image and natural language) it has received less attention in the processing of structured (or tabular) data, i.e., data organized in the form of a table (e.g., a spreadsheet or database). Instead, structured data problems are more commonly solved with treebased models such as Random Forest [35], XGBoost [36] and SVMs [37]. However, recent breakthroughs in the representation of categorical data with the introduction of entity embeddings [38] have shown that deep learning models can offer significant benefits in regression and classification problems for structured data. Unlike continuous variables, that contain continuous numerical data, categorial variables may contain numerical data (e.g., age) or numbers that map to string values (e.g., color) which have been taken from a fixed set.

Training a Deep Learning Model by Combining Continues Variables and Entity Embeddings
In this Section, a deep learning model for stress classification is proposed. The proposed model combines the continuous and categorial input variables from the WESAD dataset in a neural network model. More specifically, we leverage the entity embedding technique for the representation of categorial variables, where each fixed value of the variable is represented as a numerical vector, typically with a low dimensionality. The aforementioned technique maps discrete values to a multidimensional space through a layer of linear neurons. Thus, the relationship between discrete values can be captured in the distance of the aforementioned vectors in a similar way to how word embeddings reflect semantic similarity in the NLP domain (e.g., given the categorical variable day, Sunday could be considered more similar to Saturday than it is to Monday).
Our model includes both continuous and categorical variables. Specifically, it includes 21 continuous variables which represent the features that were extracted from the WESAD dataset. The extraction process is described in detail in Section 4. The continuous variables include the mean value of the SC signals (after being smoothed and then normalized as proposed in [39]), the median, the standard deviation and other features as proposed in [40]. The categorical variables include the user's gender, the information if the user is a smoker or not, if he/she smoked in the last hour before the experiment and if the user drank coffee in the last hour before the experiment.
The proposed neural network model for stress classification includes embedding layers for the categorical columns and a batch normalization layer for the continuous columns. The resulting representations are concatenated and fed into two fully connected layers with 200 and 100 nodes, respectively, followed by a dropout layer. A 2-hidden layer network is capable of representing an arbitrary decision boundary to a certain accuracy with rational activation functions and could approximate any smooth mapping with the accuracy [41,42]. The activation function used was ReLU as shown in the figure below ( Figure 1). Thus, the categorical variables are transformed by the embedding layer before interacting with the continuous input variables. Finally, the output layer predicts the "stress" or "no stress" classes, based on the cross-entropy loss function. This procedure is shown in the Figure 1 below, while the composition of the training and validation sets is detailed in Section 3.2.

Training a Deep Learning Model by Combining Continues Variables and Entity Embeddings
In this Section, a deep learning model for stress classification is proposed. The proposed model combines the continuous and categorial input variables from the WESAD dataset in a neural network model. More specifically, we leverage the entity embedding technique for the representation of categorial variables, where each fixed value of the variable is represented as a numerical vector, typically with a low dimensionality. The aforementioned technique maps discrete values to a multidimensional space through a layer of linear neurons. Thus, the relationship between discrete values can be captured in the distance of the aforementioned vectors in a similar way to how word embeddings reflect semantic similarity in the NLP domain (e.g., given the categorical variable day, Sunday could be considered more similar to Saturday than it is to Monday).
Our model includes both continuous and categorical variables. Specifically, it includes 21 continuous variables which represent the features that were extracted from the WESAD dataset. The extraction process is described in detail in Section 4. The continuous variables include the mean value of the SC signals (after being smoothed and then normalized as proposed in [39]), the median, the standard deviation and other features as proposed in [40]. The categorical variables include the user's gender, the information if the user is a smoker or not, if he/she smoked in the last hour before the experiment and if the user drank coffee in the last hour before the experiment.
The proposed neural network model for stress classification includes embedding layers for the categorical columns and a batch normalization layer for the continuous columns. The resulting representations are concatenated and fed into two fully connected layers with 200 and 100 nodes, respectively, followed by a dropout layer. A 2-hidden layer network is capable of representing an arbitrary decision boundary to a certain accuracy with rational activation functions and could approximate any smooth mapping with the accuracy [41,42]. The activation function used was ReLU as shown in the figure below ( Figure 1). Thus, the categorical variables are transformed by the embedding layer before interacting with the continuous input variables. Finally, the output layer predicts the "stress" or "no stress" classes, based on the cross-entropy loss function. This procedure is shown in the Figure 1 below, while the composition of the training and validation sets is detailed in Section 3.2.

Training Dataset Creation
Identification of nonspecific skin conductance responses (NS-SCRs) from the SC signals included in the WESAD dataset were used as the pivotal point of analysis in order to create our training dataset.
More specifically, the use of intensive subperiods that might appear within an emotionally annotated period can probably contribute to the final assessment of the experienced emotion (i.e., feeling stressed, happy, angry, etc.). In terms of stress detection, intensive subperiods could be interpreted as NS-SCRs. In the present study, we used only the signals of the stress sessions (TSST) as already mentioned in Section 2.1. To this end, a validated software named PhysiOBS [28,43], freely available, was used to detect and extract periods of NS-SCR (see Figure 2). PhysiOBS integrates an appropriate mechanism which can detect and export significant NS-SCRs. Next, as proposed in [13], the NS-SCR segments with a duration longer or equal to 4 s from the NS-SCR's initial deflection to peak were considered; a rule also applied in [44]. For the same time periods we also extracted the associated parts of skin temperature signal in the WESAD dataset (see also Figure 2). Both SC and ST signals were smoothed using the Hann function and then normalized as proposed in [39]. Both functionalities are supported from the software. enced emotion (i.e., feeling stressed, happy, angry, etc.). In terms of stress detection, intensive subperiods could be interpreted as NS-SCRs. In the present study, we used only the signals of the stress sessions (TSST) as already mentioned in Section 2.1. To this end, a validated software named PhysiOBS [28,43], freely available, was used to detect and extract periods of NS-SCR (see Figure 2). PhysiOBS integrates an appropriate mechanism which can detect and export significant NS-SCRs. Next, as proposed in [13], the NS-SCR segments with a duration longer or equal to 4 s from the NS-SCR's initial deflection to peak were considered; a rule also applied in [44]. For the same time periods we also extracted the associated parts of skin temperature signal in the WESAD dataset (see also Figure 2). Both SC and ST signals were smoothed using the Hann function and then normalized as proposed in [39]. Both functionalities are supported from the software.
The detected significant NS-SCRs segments within each TSST session served as the stress class and the rest parts of the stress session served as the nonstress class. The specific dataset creation approach has been also applied in [45].

Training and Classification
Regarding the numeric inputs, 21 features were extracted from the SC signal's amplitude as proposed in [40]. Furthermore, 15 features were also extracted from the ST signal's amplitude as proposed in [26]. All these features were calculated over the 380 segments extracted from the NS-SCR segments within each TSST session as described in Section 3.2. In Figure 3a, the 380 segments of the first feature (i.e., mean value of the signal first difference) of SC signal's amplitude are shown and in Figure 3b the corresponding values of ST signal's amplitude are shown. It must be noted that 165 of them correspond to the class stress and 215 to the class nonstress. It can be seen in Figure 3a that there is a clear separation between the stress and nonstress classes, that implies a higher predictive The detected significant NS-SCRs segments within each TSST session served as the stress class and the rest parts of the stress session served as the nonstress class. The specific dataset creation approach has been also applied in [45].

Training and Classification
Regarding the numeric inputs, 21 features were extracted from the SC signal's amplitude as proposed in [40]. Furthermore, 15 features were also extracted from the ST signal's amplitude as proposed in [26]. All these features were calculated over the 380 segments extracted from the NS-SCR segments within each TSST session as described in Section 3.2. In Figure 3a, the 380 segments of the first feature (i.e., mean value of the signal first difference) of SC signal's amplitude are shown and in Figure 3b the corresponding values of ST signal's amplitude are shown. It must be noted that 165 of them correspond to the class stress and 215 to the class nonstress. It can be seen in Figure 3a that there is a clear separation between the stress and nonstress classes, that implies a higher predictive value than in Figure 3b, where the values of the features are not clearly separated. This is also reflected in Table 1 where the skin conductance signal metrics outperform the corresponding skin temperature metrics.
As a next step, the extracted features were provided as input to the three machine learning algorithms (C-SVM, L-SVM and Q-SVM) and to the deep learning model aiming to differentiate the two emotional states (stress vs. nonstress).
Regarding the machine learning classification, a 5-fold cross-validation training was applied in all classification methods. Regarding the binary problem (stress vs. nonstress), Table 1 presents the obtained performance metric for each trained classifier. All classifiers achieved high accuracies (at least 91%). The best classification result was achieved by the Electronics 2021, 10, 1550 6 of 13 L-SVM classifier (93.2%). These results indicate that our applied training method improved the classification results compared to the 80% accuracy reported by [26] when using only the SC signal. Furthermore, the confusion matrix (see Figure 4) presents details about the correctly classified cases for each trained model. value than in Figure 3b, where the values of the features are not clearly separated. This is also reflected in Table 1 where the skin conductance signal metrics outperform the corresponding skin temperature metrics. As a next step, the extracted features were provided as input to the three machine learning algorithms (C-SVM, L-SVM and Q-SVM) and to the deep learning model aiming to differentiate the two emotional states (stress vs. nonstress).
(a) (b) Regarding the machine learning classification, a 5-fold cross-validation training was applied in all classification methods. Regarding the binary problem (stress vs. nonstress), Table 1 presents the obtained performance metric for each trained classifier. All classifiers achieved high accuracies (at least 91%). The best classification result was achieved by the L-SVM classifier (93.2%). These results indicate that our applied training method improved the classification results compared to the 80% accuracy reported by [26] when using only the SC signal. Furthermore, the confusion matrix (see Figure 4) presents details about the correctly classified cases for each trained model. Table 1. Performance for each signal (skin conductance: SC, skin temperature: ST) per classifier. The F1-score is also an important metric when there are imbalanced classes as in our case. The plot of sensitivity versus 1-Specifity is called the receiver operating characteristic (ROC) curve and the area under this ROC curve is called area under the curve (AUC) (see Figure 5). Both ROC and AUC are effective measures of accuracy. This curve plays a central role in evaluating the diagnostic ability of tests to discriminate the true state of subjects. The AUC can be interpreted as the probability that a randomly chosen stress signal is rated or ranked as more likely to be stress than a randomly chosen nonstress signal. All classifiers achieved high AUC (at least 94%). The best AUC result was achieved by the L-SVM classifier (98%).  The plot of sensitivity versus 1-Specifity is called the receiver operating characteristic (ROC) curve and the area under this ROC curve is called area under the curve (AUC) (see Figure 5). Both ROC and AUC are effective measures of accuracy. This curve plays a central role in evaluating the diagnostic ability of tests to discriminate the true state of subjects. The AUC can be interpreted as the probability that a randomly chosen stress signal is rated or ranked as more likely to be stress than a randomly chosen nonstress signal. All classifiers achieved high AUC (at least 94%). The best AUC result was achieved by the L-SVM classifier (98%).   Regarding deep learning, two versions of the model were tested, with and without categorical variables, to assess their impact on classification accuracy. To increase the model performance, the optimal value of the learning rate hyperparameter was selected, based on its effect on loss, as shown in Figure 6.

C-SVM L-SVM Q-SVM
Specifically, learning rate is a hyperparameter that decides how much gradient to be back propagated. This in turn determines by how much we move towards the minima. If the learning rate is set to be too small, the optimization takes a lot of time and performs tiny changes in the weights of the model which means that it makes the model converge slowly without real benefit. If the learning rate is too high, the optimizer may overshoot the minimum and may even get worst by diverging. As [46] suggests, we chose the learning rate one order lower than the learning rate where loss is minimum. Based on this approach, in the case of continuous variables (see Figure 6) the loss is minimum when the learning rate is 7 × 10 -2 , therefore we used as a starting point the value 7 × 10 -2 where loss was still decreasing and after fine grained experiments, we ended up using the value 5 × 10 -3 .    Regarding deep learning, two versions of the model were tested, with and without categorical variables, to assess their impact on classification accuracy. To increase the model performance, the optimal value of the learning rate hyperparameter was selected, based on its effect on loss, as shown in Figure 6.

Predicted Classes / model
Specifically, learning rate is a hyperparameter that decides how much gradient to be back propagated. This in turn determines by how much we move towards the minima. If the learning rate is set to be too small, the optimization takes a lot of time and performs tiny changes in the weights of the model which means that it makes the model converge slowly without real benefit. If the learning rate is too high, the optimizer may overshoot the minimum and may even get worst by diverging. As [46] suggests, we chose the learning rate one order lower than the learning rate where loss is minimum. Based on this approach, in the case of continuous variables (see Figure 6) the loss is minimum when the learning rate is 7 × 10 -2 , therefore we used as a starting point the value 7 × 10 -2 where loss was still decreasing and after fine grained experiments, we ended up using the value 5 × 10 -3 . Regarding deep learning, two versions of the model were tested, with and without categorical variables, to assess their impact on classification accuracy. To increase the model performance, the optimal value of the learning rate hyperparameter was selected, based on its effect on loss, as shown in Figure 6.

Predicted Classes / model
Specifically, learning rate is a hyperparameter that decides how much gradient to be back propagated. This in turn determines by how much we move towards the minima. If the learning rate is set to be too small, the optimization takes a lot of time and performs tiny changes in the weights of the model which means that it makes the model converge slowly without real benefit. If the learning rate is too high, the optimizer may overshoot the minimum and may even get worst by diverging. As [46] suggests, we chose the learning rate one order lower than the learning rate where loss is minimum. Based on this approach, in the case of continuous variables (see Figure 6) the loss is minimum when the learning rate is 7 × 10 −2 , therefore we used as a starting point the value 7 × 10 −2 where loss was still decreasing and after fine grained experiments, we ended up using the value 5 × 10 −3 . One of the critical issues while training a neural network is overfitting [47]. Although we need a number of epochs to train a neural network model, the training model learns patterns that are specific to the sample data. In other words, the model loses generalization capacity by overfitting to the training data. To avoid overfitting and increase the generalization capacity of the neural network, the model should be trained for an optimal number of epochs. Loss and accuracy on the training set as well as on the validation set are monitored to look over the epoch number after which the model starts overfitting. As we can see from Figure 7, as the number of epochs increases beyond 10, the training set loss decreases but validation loss increases, depicting the overfitting of the model on training data. So, the ideal number of epochs is the point where the training loss is decreasing but the validation loss starts increasing and as we can see from Figure 7, the ideal number of epochs for our model (without categorical variables) is 10.  One of the critical issues while training a neural network is overfitting [47]. Although we need a number of epochs to train a neural network model, the training model learns patterns that are specific to the sample data. In other words, the model loses generalization capacity by overfitting to the training data. To avoid overfitting and increase the generalization capacity of the neural network, the model should be trained for an optimal number of epochs. Loss and accuracy on the training set as well as on the validation set are monitored to look over the epoch number after which the model starts overfitting. As we can see from Figure 7, as the number of epochs increases beyond 10, the training set loss decreases but validation loss increases, depicting the overfitting of the model on training data. So, the ideal number of epochs is the point where the training loss is decreasing but the validation loss starts increasing and as we can see from Figure 7, the ideal number of epochs for our model (without categorical variables) is 10.
Electronics 2021, 10, x FOR PEER REVIEW 8 of 13 Figure 6. Learning rate chart. Pink rectangle indicates an area of optimal choices. In our case the ~7 × 10 −2 learning rate was used.
One of the critical issues while training a neural network is overfitting [47]. Although we need a number of epochs to train a neural network model, the training model learns patterns that are specific to the sample data. In other words, the model loses generalization capacity by overfitting to the training data. To avoid overfitting and increase the generalization capacity of the neural network, the model should be trained for an optimal number of epochs. Loss and accuracy on the training set as well as on the validation set are monitored to look over the epoch number after which the model starts overfitting. As we can see from Figure 7, as the number of epochs increases beyond 10, the training set loss decreases but validation loss increases, depicting the overfitting of the model on training data. So, the ideal number of epochs is the point where the training loss is decreasing but the validation loss starts increasing and as we can see from Figure 7, the ideal number of epochs for our model (without categorical variables) is 10. Regarding the binary problem (stress vs. nonstress), Table 2 presents the obtained performance metric for each trained classifier (with and without categorical variables. The best accuracy (97.4%) achieved by the deep learning model with categorical variables. These results indicate that our applied training method improved the classification results compared to the machine learning results (92.1%) and outperformed the 80% accuracy reported by [26] when using only the SC signal.

Ground Truth Physiological Dataset-UX Context
This section presents the ground truth physiological dataset that was used in order to assess the stress detection models created from the WESAD dataset. The UX dataset consists of skin conductivity (SC) segments that have been emotionally annotated by users' self-reported ratings (valence−arousal scale). The recorded SC segments indicate usability issues confronted by users while they were interacting with a platform in the context of a UX evaluation study [28].
More specifically, the aforementioned study involved 30 participants (13 female), aged between 18 and 45 (mean = 32.1, SD = 7.1) who were asked to complete two interaction tasks on a web-based service while their SC was recorded. At the end of each user testing session, each participant was involved in a retrospective think aloud (RTA) protocol in order to report any usability issues (UIs) that they had confronted while performing the interaction tasks. For each one of their retrospectively reported UIs, the participant was asked to provide: (a) the duration of the confronted UI and (b) an emotional rating, using the emotional scale of valence (from 1 to 9)-arousal (from 1 to 9). All in all, a number of 113 emotionally annotated UIs were reported. For each annotated UI there was an associated segment of SC signal that constituted the ground truth biosignals dataset that was used in the present study to test the stress detection created from the WESAD dataset.

Evaluation of Classifiers
In Section 3 we presented the training process and the results of four classifiers. To this end, the SC signals from the publicly available dataset named WESAD were used. In order to measure the performance of the skin conductance trained models we used the dataset presented in Section 4.1. More specifically, the test dataset consists of 113 emotionally annotated (according to VA ratings) user-reported SC segments. The Kappa coefficient [48] metric was used to quantify the agreement between the users' emotional ratings and created classifiers. Agreement among the raters ranged from −1 to 1. Values near or below zero suggest that the agreement is probably attributable to chance. In contrast, the higher the positive value of Kappa is, the higher the reliability is.
Regarding the ground truth dataset SC segments with a valence lower than 5 and arousal greater than 5 were assigned as stress and the rest SC segments as nonstress [39].
Next, the 113 SC segments, were used as an input to the trained classifiers. For each segment, each classifier returned the classification result (1 = stress, 2 = nonstress). The returned values of the stress models were compared with participants' self-reported stress ratings that constituted our ground truth dataset. Table 3 presents the interrater reliability for each classifier. According to the levels of agreement presented in [49], the Q-SVM achieved a nonsignificant slight agreement; Kappa = 0.17, p > 0.05, 95% CI [−0.01, 0.35] indicating a fair agreement. The other two classifiers (C-SVM, L-SVM) returned Kappa values very close to zero, which means that there was no agreement at all. The same procedure was followed with the deep learning model that was built and trained as described in Section 3.3. The proposed approach yielded very impressive results, compared to the machine learning classifiers, as the model achieved a Kappa Value of 0.27 (p < 0.05) which is 58.8% higher compared to the Q-SVM classifier shown in the table below (Table 3). It must be noted, that the deep learning model without categorical variables was leveraged, to ensure a fair comparison with the SVM classifiers.

Conclusions
There is a large number of publicly available physiological datasets that have been recorded during stress research. The majority of them have been recorded in a context where users have been exposed to intense stressors typically found in real-life conditions. Although such approaches are able to create stress prediction models with rather high classification accuracy, it remains questionable if they could be effectively used in capturing subtle stress responses, which are mostly expected in different contexts, such as UX evaluation studies [25].
In this paper we try to address the aforementioned question by conducting an indepth investigation of the performance of such a dataset in the context of UX evaluation by combining traditional machine learning and deep-leaning approaches. To the best of our knowledge, this is the first study that uses deep-learning for stress detection in the UX context. More specifically, the WESAD dataset was used in order to train three popular machine learning classifiers and a neural network (NN). Regarding the binary classification problem (stress vs. nonstress), accuracy of up to 97.4% was reached by the NN classifier.
We assessed the performance of the stress models by conducting an interrater reliability analysis using the Kappa coefficient. To this end, an existing biosignals dataset, consisting of SC segments, was used as the ground truth dataset. The SC segments of the ground truth dataset represent users' self-reported periods of usability issues confronted while they were interacting with a web-based platform during a UX evaluation.
With regard to results, the higher interrater reliability was found for the NN model; Kappa = 0.27, p < 0.05. The aforementioned level of agreement is quite comparable with the agreement level presented in [28]. In the latter, the same ground truth dataset was used. The reported interrater reliability was found to be statistically significant and fair-tomoderate; Kappa = 0.35, p < 0.001. This is probably explained by the fact that the stress assessment mechanism that was assessed against the ground truth dataset was trained with biosignals that had been recorded while users performed typical HCI tasks.
In the present study we assessed the classifiers performance by using only skin conductance. Such an approach aims to maximize practicality by reducing the number of sensors while maintaining accuracy in high levels. The present study serves as a first proof of concept by investigating if a dataset emotionally annotated in a context where users have been exposed to intense stressors can indeed be used effectively in a different context (i.e., UX evaluation). In the next steps of our work, more participants and additional datasets will be included to further increase the objectivity and accuracy of the presented results. In addition, other approaches, such as the use of categorical variables features in traditional machine learning (e.g., SVM), subject dependent training could create more efficient stress assessment mechanisms. Furthermore, a combination of the biosignals dataset from various contexts could be also a challenge for future work.
Overall, the results presented in this paper reveal that the use of existing biosignal datasets in various contexts should be carefully taken into consideration. Although the one size fits all approach is not suggested, this study provides interesting insights into the generalizability of the biosignals datasets.
Author Contributions: Conceptualization, A.L. and E.F.; methodology, A.L. and E.F.; writingoriginal draft preparation, A.L. and E.F.; writing-review and editing, All authors, visualization, All authors; supervision, C.P.A., G.K., N.V. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.