Stress Analysis with Dimensions of Valence and Arousal in the Wild

: In the ﬁeld of stress recognition, the majority of research has conducted experiments on datasets collected from controlled environments with limited stressors. As these datasets cannot represent real-world scenarios, stress identiﬁcation and analysis are difﬁcult. There is a dire need for reliable, large datasets that are speciﬁcally acquired for stress emotion with varying degrees of expression for this task. In this paper, we introduced a dataset for Stress Analysis with Dimensions of Valence and Arousal of Korean Movie in Wild (SADVAW), which includes video clips with diversity in facial expressions from different Korean movies. The SADVAW dataset contains continuous dimensions of valence and arousal. We presented a detailed statistical analysis of the dataset. We also analyzed the correlation between stress and continuous dimensions. Moreover, using the SADVAW dataset, we trained a deep learning-based model for stress recognition.


Introduction
Stress is a normal phenomenon that is characterized by a feeling of emotional or physical tension. Recognition of multiple levels of stress is challenging due to the lack of easily accessible data with real-world context. Most of the currently available datasets are either acquired through a controlled lab environment or provide only a broad classification of different emotions.
According to the dimensional approach, affective states are not independent of each other. The circumplex model of affect [1] defines the emotion in arousal and valence dimensions, where arousal refers to the person's agitation level, and valence refers to how pleasant or unpleasant the feeling is. The psychological evidence suggests that these two dimensions are intercorrelated [2][3][4]. Likewise, stress has both positive and negative interpretations based on its origin. Stress can arise from various events; for example, getting a new job with high demand can keep a person excited while being in a stressed condition, while failing to get the desired result in any activity may induce negative stress. High arousal and negative valence are characteristics of emotional stress [5], an affective state induced by threatening stimuli. The adjectives typically used to describe emotional stress include "stressed", "nervous", and "tense", with antonyms being "relaxed", "calm", and "at ease" [6]. High arousal and negative valence are also characteristic of acute affective states: the specific emotions of anger, disgust, and fear [7]. Therefore, in the stress recognition process, identifying the position of emotion in arousal-valence space is significant. Due to the crucial effects of stress on humans, there have been several studies related to stress analysis, such as stress assessment in the workplace [8,9] or using electronics equipment to detect stress [10,11]. The purpose of such studies is to understand stress better and build a computational mechanism to detect stressful situations. For example, assessing employees' stress levels through monitoring systems enables the adjustment of working conditions, thereby improving the performance of the employees [8,9].
To identify various levels of stress, we present a new publicly available dataset named Stress Analysis with Dimensions of Valence and Arousal in Wild (SADVAW), which focuses on stress in 9 different levels of expression and the representation of stress in valencearousal space. In this paper, we present three main contributions: • Firstly, we analyze the relationship between stress and continuous dimensions of valence and arousal. • The second, the SADVAW dataset, which is constructed to evaluate multiple stress levels in continuous dimensions of valence and arousal based on facial features. The SADVAW dataset consists of 1236 video clips extracted from 41 Korean movies. Video clips are extracted from many characters with different backgrounds, closer to the realworld environment. The video clips were evaluated on a 9-level scale for each stress, valence, and arousal class.

•
The third contribution is developing a baseline model based on deep learning techniques for stress level detection. In particular, we first detect and extract human faces using TinyFace [12] model. Then, we use ResNet [13] (pp. 770-778) to extract an important feature vector for each frame of the video clip. The sequence of features is trained by Long Short-Term Memory (LSTM) [14], followed by fully connected layers to predict the stress, arousal, and valence level. Based on the SADVAW dataset, we determined the correlation of stress with valence and arousal. Furthermore, we aim to use this analysis for stress detection systems through images/videos captured from real-world situations.
The remainder of this paper is organized as follows. Section 2 discusses the related works on data sets for emotion recognition. Section 3 provides details of the dataset construction process and dataset statistics. Section 4 describes the details of the baseline model, along with experiments and a discussion of the results. Finally, Section 5 presents the conclusions drawn from the study.

Existing Stress Dataset
In recent years, numerous datasets related to emotion recognition have been published. However, only a few of them focus on a specific emotion, such as stress. Among the stress-related datasets, the WeSAD dataset [15] has been widely explored in recent stress recognition studies. The dataset consists of physiological and motion data collected from 15 subjects using wearable devices worn on the wrist and the chest. The ground truth was assigned based on the data collection protocol, where the neutral, stressed, and amused states were induced using stressors. The dataset aims at classifying the neutral, stressed, and amused state of the participants; however, it does not focus on stress with a granular classification of stress at multiple levels.
MuSE dataset [16] consists of the recording of 28 University of Michigan college students in the process of answering monologue questions and watching emotional videos. After responding to the questions, the participants were asked to mark themselves on two emotion dimensions: activation and valence. After watching each video clip, the participants marked an emotion category (angry, sad, etc.).
Similarly, the SWELL-KW dataset [17] is the result of experiments conducted on 25 subjects doing office work, such as writing reports, making presentations, reading e-mail, and searching for information. The collected data consists of the computer usage patterns, facial expressions, body postures, electrocardiogram (ECG) signal, and electrodermal activity (EDA) signal. All participants worked under all three conditions: no stress, time pressure, and interruptions.
However, the WeSAD [15], MuSE [16], and SWELL-KW [17] datasets were collected in controlled laboratory conditions with limited context, and stressors such as time pressure and interruptions [17], or emotion elicitation clips [15,16], while stress has many different levels and is variously expressed in each particular case. Therefore, we have built a large video dataset to solve these limitations. The dataset focuses on stress detection based on facial expression. The stress values also were evaluated with multi-levels. Besides, the collected data has diverse faces in various situations under the impact of many different factors. This eliminates the limitation of datasets collected from the lab and makes the dataset closer to the real-world environment. The comparison with related datasets is summarized in Table 1. Furthermore, the models used by these datasets for evaluation, while achieving high prediction accuracy, are inflexible and expensive to deploy in natural environments because these models need many items of equipment to measure many parameters. In contrast, with the proposed dataset, the models only use cameras to catch the user's face. Table 1. Comparision of the SADVAW dataset with the existing stress datasets WeSAD [15], MuSE [16], and SWELL-KW [17].

LSTM Architecture
Hochreiter and Schmidhuber introduced Long Short-Term Memory (LSTM) in 1997 [14]-a type of recurrent neural network. This architecture is particularly well suited for sequential data such as video or audio. The crucial difference of LSTM compared to other recurrent neural networks is memory cell-a cell state to save long-term states, which allows LSTM to remember and connect previous information of data, that was introduced in detail in [18]. The LSTM unit structure is shown in Figure 1. Each LSTM unit comprises three gates: input gate, forget gate, and output gate. Each port is assigned a specific task. First, the input gate adds and controls the amount of information flow into the memory cell. Second, the forget gate defines the previous cell state that needs to be overlooked and keeps part of the data in the current moment. Finally, the output gate controls the amount of data flows in the rest of the networks.
where x t is input at the current time t; f t , i t , o t are the vectors of the forget gate, input gate, and output gate, respectively. Character '•' denotes the Hadamard product; W f , W i , W c , W o are weight matrices of the input x t and U f , U i , U c , U o are weight matrices of the previous hidden state h t−1 at the forget gate, input gate, cell state, and output gate, respectively; b f , b i , b c , b o are the bias parameters. σ, tanh are the sigmoid function and hyperbolic tangent function, respectively. C t refers to a tanh output. C t and C t−1 denote the current and previous cell states, respectively. h t and h t−1 denote the current and previous output, respectively.
where is input at the current time t; , , are the vectors of the forget gate, inp gate, and output gate, respectively. Character '∘' denotes the Hadamard product; , , are weight matrices of the input and , , , are weight matrices of t previous hidden state ℎ at the forget gate, input gate, cell state, and output gate, spectively; , , , are the bias parameters. , ℎ are the sigmoid function and h perbolic tangent function, respectively. refers to a tanh output. and denote t current and previous cell states, respectively. ℎ and ℎ denote the current and pre ous output, respectively.

Dataset Overview
This paper introduces a new stress recognition dataset-Stress Analysis with Dime sions of Valence and Arousal in Wild (SADVAW) (The SADVAW dataset can be dow loaded at https://www.kaggle.com/c/2020kerc/data, accessed on 13 April 2021)-a colle tion of video clips of actors in movies. The levels of stress, valence, and arousal of t actors are evaluated based on their apparent emotional states. The SADVAW dataset co tains 1236 video clips, where each clip focuses on only one person. The video clips we collected from 41 different Korean movies that represent diverse contexts. The video cli were extracted from movies using a tool in [25]. Then, to ensure the data set's quality, t clips with invisible or obstructed faces of actors were removed. Therefore, each video c focuses on a visible face, composed of facial expressions, of a person involved in differe activities. The data consist of people in many different environments, such as office, hom or the outdoors. Data is provided for the detection model focusing on facial expression LSTM is widely used and yields good results in emotion recognition [19][20][21], video classification [22], and speech recognition [23,24].

Dataset Overview
This paper introduces a new stress recognition dataset-Stress Analysis with Dimensions of Valence and Arousal in Wild (SADVAW) (The SADVAW dataset can be downloaded at https://www.kaggle.com/c/2020kerc/data, accessed on 13 April 2021)-a collection of video clips of actors in movies. The levels of stress, valence, and arousal of the actors are evaluated based on their apparent emotional states. The SADVAW dataset contains 1236 video clips, where each clip focuses on only one person. The video clips were collected from 41 different Korean movies that represent diverse contexts. The video clips were extracted from movies using a tool in [25]. Then, to ensure the data set's quality, the clips with invisible or obstructed faces of actors were removed. Therefore, each video clip focuses on a visible face, composed of facial expressions, of a person involved in different activities. The data consist of people in many different environments, such as office, home, or the outdoors. Data is provided for the detection model focusing on facial expressions. This dataset uses the same videos in [25], but two datasets were created with two different objectives. In particular, the dataset in [25] relates to emotion recognition in the videos with seven basic emotions, while in this work, the dataset was built for stress recognition and affective dimensions in continuous values. Figure 2 shows examples of people annotated with different scores of stress, valence, and arousal in the SADVAW.

Annotators and Evaluation Process
A total of 27 (12 males and 15 females) college students, with ages ranging from 21 to 26, voluntarily participated in this study. They were asked to evaluate and annotate for video clips in the dataset with nine scales for each class (stress, valence, and arousal). They had no history of mental or psychiatric ailments. The participation criteria also required candidates to not be under medication, even for simple illnesses, such as the common cold because health and mental issues can affect the emotional perception of the characters in the video. The annotators were trained before carrying out the evaluation. They were instructed to immediately and quickly judge the facial expressions for emotion labels as they felt and were told to avoid making a conscious effort to respond. They were also prohibited from any discussion with others to avoid any bias or influence. Appl. Sci. 2021, 11, x 6 of 15  Annotation sessions were conducted for 3 h each in the morning and afternoon for 2 days. The evaluators were divided into two groups. The first group (14 people) evaluated on the morning of the first day and the second day's afternoon. The second group (13 people) evaluated on the afternoon of the first day and the morning of the second day. Annotators rated each video clip on a nine-point scale, where the range (1 to 9) represents the least positive to the most positive, the least agitated to the most agitated, and the least stressful to the most stressful, for valence, arousal, and stress, respectively (see Table 2). For example, Figure 3 illustrates some examples of a frame in video clips in the SADVAW database, along with their corresponding annotations. At the top-left of Figure 3, the man is looking towards the woman with a cheerful and happy smile. According to the evaluated values, the man was in an active state, without stress, and had a low agitation level. In Figure 3, at bottom-left, the girl was cautious of someone, showed a little nervousness, and was therefore evaluated for neutral stress, valence, and stress. Similarly, people in Figure 3, at top-right and bottom-right, were evaluated for high stress, high level of agitation, and valence of negative status. Table 2. Description of the stress and dimensions of valence and arousal.

Class Description
Stress (1-9) How stressed does the person feel? (non-stress-stress) Valence (1-9) How positive or negative is their emotion? (negative-positive) Arousal (1-9) What is the agitation level of the person? (inactive-active) had no history of mental or psychiatric ailments. The participation criteria also required candidates to not be under medication, even for simple illnesses, such as the common cold because health and mental issues can affect the emotional perception of the characters in the video. The annotators were trained before carrying out the evaluation. They were instructed to immediately and quickly judge the facial expressions for emotion labels as they felt and were told to avoid making a conscious effort to respond. They were also prohibited from any discussion with others to avoid any bias or influence. Annotation sessions were conducted for 3 hours each in the morning and afternoon for 2 days. The evaluators were divided into two groups. The first group (14 people) evaluated on the morning of the first day and the second day's afternoon. The second group (13 people) evaluated on the afternoon of the first day and the morning of the second day. Annotators rated each video clip on a nine-point scale, where the range (1 to 9) represents the least positive to the most positive, the least agitated to the most agitated, and the least stressful to the most stressful, for valence, arousal, and stress, respectively (see Table 2). For example, Figure 3 illustrates some examples of a frame in video clips in the SADVAW database, along with their corresponding annotations. At the top-left of Figure 3, the man is looking towards the woman with a cheerful and happy smile. According to the evaluated values, the man was in an active state, without stress, and had a low agitation level. In Figure 3, at bottom-left, the girl was cautious of someone, showed a little nervousness, and was therefore evaluated for neutral stress, valence, and stress. Similarly, people in Figure 3, at top-right and bottom-right, were evaluated for high stress, high level of agitation, and valence of negative status.
We computed the average value, median, and average of the values within the range of two standard deviations (±2σ) [26] around the mean for each video clip, which 27 people evaluated. The results show that the calculated values were similar or differed slightly. Therefore, we chose the values within the mean ±2σ range as ground truth. Table 2. Description of the stress and dimensions of valence and arousal.

Annotation Agreement
To measure the agreement of annotators, we calculated the percentage of videos with the agreement of n (max(n) = 27) annotators on each value of stress, valence, and arousal We computed the average value, median, and average of the values within the range of two standard deviations (±2σ) [26] around the mean for each video clip, which 27 people evaluated. The results show that the calculated values were similar or differed slightly. Therefore, we chose the values within the mean ±2σ range as ground truth.

Annotation Agreement
To measure the agreement of annotators, we calculated the percentage of videos with the agreement of n (max(n) = 27) annotators on each value of stress, valence, and arousal as shown in Table 3a-c, respectively. The values in a table from lowest to highest have been shown clearly by color scales. The results showed that 77.48% of videos agree with two annotators and more for all levels of stress, 78.6% of valence, and 78.94% of arousal. As it is shown, the highest percentage of videos with two or more annotators' agreement is on level four on valence (17.83%), and six on stress (14.54%) and arousal (15.16%); while the lowest percentage of videos with two or more annotators' agreement is on level nine of valence (1.09%), on level one of stress (2.64%), and arousal (1.95%). Visual inspection shows that some of the videos are likely to be understood differently and can be labeled with different levels of stress, valence, and arousal. For example, Figure 4 shows the number of people who annotated the given video with various levels of stress, valence, and arousal.    Figure 4 shows the number of people who annotated the given video with various levels of stress, valence, and arousal.     Figure 5 shows the number of annotated videos for every value of stress, valence, and arousal. The number of videos is unevenly distributed across all values of the range (1-9), which makes the dataset particularly challenging. In particular, Figure 5a shows more examples of stress values from (6 to 7.5) associated with stress than those of the stress values associated with unstress characterized by values (1 or 2). In Figure 5b Figure 5 shows the number of annotated videos for every value of stress, valence, and arousal. The number of videos is unevenly distributed across all values of the range (1-9), which makes the dataset particularly challenging. In particular, Figure 5a shows more examples of stress values from (6 to 7.5) associated with stress than those of the stress values associated with unstress characterized by values (1 or 2). In Figure 5b Table 4 shows the correlation relationship between stress, valence, and arousal dimensions, according to Pearson's correlation coefficient (PCC). For example, we observe that the PCC value between stress and valence is negative (r = −0.96), which negatively correlates. This means that when the value of stress increases, the value of valence tends to decrease, and vice versa (Figure 6a). Conversely, the PCC value between stress and arousal is positive (r = 0.78), indicating a positive correlation. This implies that as the value of stress increases, the value of arousal tends to increase, while with the decrease in the value of stress, it decreases, as seen in Figure 6b. Table 4. Pearson's correlation coefficient between stress with two dimensions of valence and arousal.

Valence
Arousal Stress −0.96 0.78 In addition, we evaluated statistically using t-test [27] to evaluate valence in stress (stress value ≥ 6.0) and non-stress (stress value < 6.0) groups. According to Table 5, the obtained one-tail p < 0.05 indicated a significant difference in negative emotions in the stress and non-stress groups. The negative t-stat = −56.34 means that the value of valence in the non-stress group is greater than that in the stress group. In other words, the nonstressed people have a more positive state than those who are stressed. Therefore, we concluded that the stress perception level influenced the negative/positive emotion perception of facial expressions.   Table 4 shows the correlation relationship between stress, valence, and arousal dimensions, according to Pearson's correlation coefficient (PCC). For example, we observe that the PCC value between stress and valence is negative (r = −0.96), which negatively correlates. This means that when the value of stress increases, the value of valence tends to decrease, and vice versa (Figure 6a). Conversely, the PCC value between stress and arousal is positive (r = 0.78), indicating a positive correlation. This implies that as the value of stress increases, the value of arousal tends to increase, while with the decrease in the value of stress, it decreases, as seen in Figure 6b.  Similarly, the one-tail value (p < 0.05) of arousal in the stress (stress value ≥ 6.0) and non-stress (stress value < 6.0) groups showed a statistically significant difference. The tstat = 11.18 is a positive value that indicates that the agitation state of the stressed people is higher than that of the non-stressed people (see more detail in Table 5).  In addition, we evaluated statistically using t-test [27] to evaluate valence in stress (stress value ≥ 6.0) and non-stress (stress value < 6.0) groups. According to Table 5, the obtained one-tail p < 0.05 indicated a significant difference in negative emotions in the stress and non-stress groups. The negative t-stat = −56.34 means that the value of valence in the non-stress group is greater than that in the stress group. In other words, the non-stressed people have a more positive state than those who are stressed. Therefore, we concluded that the stress perception level influenced the negative/positive emotion perception of facial expressions. Similarly, the one-tail value (p < 0.05) of arousal in the stress (stress value ≥ 6.0) and non-stress (stress value < 6.0) groups showed a statistically significant difference. The t-stat = 11.18 is a positive value that indicates that the agitation state of the stressed people is higher than that of the non-stressed people (see more detail in Table 5).

Experimental Setup
The dataset is divided into three sets: training (636 videos), validation (300 videos), and testing (300 videos). The experiment was performed in three steps. In the first step, we used TinyFace [12] model to detect faces from the video frames. The number of detected faces on the train, validation, and test sets were 43,328, 20,314, and 20,924, respectively. In the second step, we used the Resnet50 model [13], pre-trained on the ImageNet [28] dataset with 1000 class labels, to extract features with an input image size of 224 × 224 × 3. Finally, in the third step, we used two LSTMs followed by four fully-connected layers to train and predict stress, valence, and arousal values using a sigmoid function. Note that we normalize all classes to lie in the range (0-1). Moreover, we use mean squared error as a loss function and Adam function as an optimizer. The model output includes three values of stress, valence, and arousal. Table 6 shows the network parameters. Besides, the overall architecture of the baseline model is shown in Figure 7.

Evaluation Metrics
We evaluated the calculations and determined the relationships of the relevant variables using various metrics, such as mean squared error (MSE), mean relative error (MRE) [29], and Pearson's correlation coefficients. The mean squared error is an evaluation measure commonly used as a loss function for training regression models [30]. It is used to evaluate the average squared difference between the predicted value and the ground- We used LSTM for the baseline model because it could process entire data sequences (such as speech or video). Besides, it was designed to remember information for long peri-ods; this is implicitly that output may be affected by a value that has been inputted a long time ago. This makes sense for the task to detect stress based on facial expression because facial changes through video frames all contribute to better recognition.

Evaluation Metrics
We evaluated the calculations and determined the relationships of the relevant variables using various metrics, such as mean squared error (MSE), mean relative error (MRE) [29], and Pearson's correlation coefficients. The mean squared error is an evaluation measure commonly used as a loss function for training regression models [30]. It is used to evaluate the average squared difference between the predicted value and the ground-truth value [31,32], as shown in Equation (2). Let Y be the ground truth, andŶ be the prediction. MSE is defined as follows: The mean relative error measures the ratio of the absolute error of a measurement to the measurement being taken. This is expressed as follows: The Pearson's correlation coefficient is the most commonly used correlation statistic to measure the degree of the relationship between two variables. It is calculated by Equation (4):  In addition, Figure 8 summarizes the results obtained for each example in the test set. Figure 8a shows the MSE of valence for the stress and non-stress samples, where the stress cases obtained lower error. Besides Figure 8b illustrates the results of arousal, where the non-stress samples achieved better performance than the stress cases.

Experimental Results
Finally, Figure 9 shows the baseline model's predictions with quite good predictable results in the left column's images, and a large gap between ground truth and prediction values (indicated in red) in the right column's images.

Mean
1.34 20.10 0.50 In addition, Figure 8 summarizes the results obtained for each example in the test set. Figure 8a shows the MSE of valence for the stress and non-stress samples, where the stress cases obtained lower error. Besides Figure 8b illustrates the results of arousal, where the non-stress samples achieved better performance than the stress cases. Finally, Figure 9 shows the baseline model's predictions with quite good predictable results in the left column's images, and a large gap between ground truth and prediction values (indicated in red) in the right column's images. . Ground truth and results on clips for stress recognition (Sp: prediction stress; Vp: prediction valence; Ap: prediction arousal; Sg: ground truth stress; Vg: ground truth valence; Ag: ground truth arousal).

Conclusions
Our analysis shows that stress has a positive correlation with arousal and a negative with valence. We also presented the SADVAW dataset that contains video clips closer to the actual scene to address the problem of stress recognition in the wild for this task. The video clips were evaluated and annotated with a 9-level scale for stress, valence, and arousal classes. Moreover, we developed a baseline model based on deep learning techniques to verify that the levels of stress, valence, and arousal could be assessed in our dataset. The SADVAW dataset has also contributed to clarifying the correlation between stress and the dimensions of valence and arousal.
Although the data collected from movies represent the real-world scenario more ac- Figure 9. Ground truth and results on clips for stress recognition (Sp: prediction stress; Vp: prediction valence; Ap: prediction arousal; Sg: ground truth stress; Vg: ground truth valence; Ag: ground truth arousal).

Conclusions
Our analysis shows that stress has a positive correlation with arousal and a negative with valence. We also presented the SADVAW dataset that contains video clips closer to the actual scene to address the problem of stress recognition in the wild for this task. The video clips were evaluated and annotated with a 9-level scale for stress, valence, and arousal classes. Moreover, we developed a baseline model based on deep learning techniques to verify that the levels of stress, valence, and arousal could be assessed in our dataset. The SADVAW dataset has also contributed to clarifying the correlation between stress and the dimensions of valence and arousal.
Although the data collected from movies represent the real-world scenario more accurately than the data collected from controlled environments, there also exists certain limitations such as influence of scripted scenario and acting ability of the actors. Additionally, the facial expressions of the movie characters are more intense as compared to real-life expressions because the movies are designed to impact the viewers as much as possible. However, the clips taken from popular movies with experienced and professionally-trained actors minimize such limitations.
In the future, we hope to increase the number of samples available in the dataset and refine the quality of selected samples. In particular, we are interested in collecting more samples of currently infrequent values such as stress values, arousal values at 1, 2 or valence values at 8, 9. We also intend to use the dataset to further research the real-world emotion expressions.