Real-Time Psychological Stress Detection According to ECG Using Deep Learning

: Today, excessive psychological stress has become a universal threat to humans. That stress can heavily affect work and study when a person repeatedly is exposed to high stress. If that exposure is long enough, it can even cause cardiovascular disease and cancer. Therefore, both monitoring and managing of stress is imperative to reduce the bad outcomes from excessive psychological stress. Conventional monitoring methods ﬁrstly extract the characteristics of the RR interval of an electrocardiogram (ECG) from a time domain and a frequency domain, then use machine learning models, like SVM, random forest, and decision tree, to distinguish the level of that stress. The biggest limitation of using these methods is that at least one minute of ECG data and other signals are indispensable to ensure the high accuracy of the results. This will greatly affect the real-time application of the models. To satisfy real-time detection of stress with high accuracy, we proposed a framework based on deep learning technology. The proposed monitoring framework is based on convolutional neural networks (CNN) and bidirectional long short-term memory (BiLSTM). To evaluate the performance of this network, we conducted the experiments applying conventional methods. The data for the 34 subjects were collected on the server platform created by the group at the Institute of Psychology of the Chinese Academy of Sciences and our group. The accuracy of the proposed framework was up to 0.865 on three levels of stress using a 10 s ECG signal, a 0.228 improvement compared with conventional methods. Therefore, our proposed framework is more suitable for real-time applications


Introduction
Economic development has led to fierce competition among people. People are now more prone to be affected by high psychological stress. Excessive psychological stress reduces work efficiency, affects relationships and transportation safety [1]. Long-term stress can even induce depression, addiction, and cardiovascular and cerebrovascular diseases [2,3]. Excessive emotional stress has become a major problem affecting human physical and mental health. MAE (ecological momentary assessment) [4] and JITAI (just-intime adaptive interventions) [5] are two effective methods that can be used to deal with the negative consequences of excessive psychological stress. However, the two methods both require real-time monitoring of psychological stress, and the lack of a method that can monitor emotional stress in real time has become the main problem.
It is fortunate that there is an important relationship between psychological stress and the autonomic nervous system, i.e., SNS (sympathetic nervous system) and PNS (parasympathetic nervous system). The exposure to stress will enhance SNS's function, including shortening the RR interval, increasing the low-frequency energy of HRV (heart rate variability), increasing the respiration rate, decreasing HRV's high-frequency energy, and more [6]. After stress, the PNS will be activated to generate the opposite effect. Therefore, the features of the RR interval and other physiological parameters could be used to detect stress. As an electrocardiogram (ECG) signal contains RR interval information and is easy to obtain, a lot of research has been undertaken on judging psychological stress via the ECG signal. Traditionally, five-minute ECG data have been used to detect stress [7]. Although the accuracy of these studies is reliable, the 5 min time phase is too long for real-time monitoring. To promote real-time application, some researchers have successfully detected stress on one-minute ECG and respiration [8] with high accuracy. However, the acquisition of breathing signals requires wearing a bandage, which will greatly reduces the comfort. Furthermore, the interval of one minute is not short enough to satisfy the requirements of real-time monitoring.
Deep learning technology, proposed in recent years, has greatly promoted the development of artificial intelligence, especially for the CNN (convolutional neural network) and RNN (recurrent neural network). There have been some studies that have used deep learning techniques to monitor psychological stress. Winata used the LSTM (long short-term memory) model with an attention mechanism to classify stress for spoken language, and the accuracy achieved was 0.741 [9]. Li Jin built models using DNN (deep neural network) and a conditional random field to classify stress regarding information on adolescent social behavior [10]. Lin used a deep sparse neural network to monitor the stress level from cross-media microblog data [11]. Bosun Hwang [12] categorized stress using 10 s ECG data with CNN and LSTM, but there are only two categories. Se-Hui Song [13] categorized stress into four classes, using SBP (systolic blood pressure), DBP (diastolic blood pressure), sleep time, heart rate, and age with DBN (deep belief network) and the accuracy only reached 0.66, and the data sleep time, SBP, DBP used did not satisfy a real-time application well. Although these studies have proven their models to be effective, they did not consider the need for real-time application.
To address this issue, considering the effectiveness of deep learning technology and the easy accessibility of ECG signals, we proposed a network according to ECG using deep learning technology. For this proposed network, we also built a psychological stress monitoring platform on which data collection and analysis function were integrated. Using it, the ECG data of 34 people were collected and used for training and testing the models. To show the proposed network's performance, we also conducted the conventional feature extracting method applying nine machine learning methods (SVM, decision tree, random forest and so on). In the current study, we divided mental stress into three levels, i.e., low, medium and high. To the best of our knowledge, it is the first time that deep learning technology has been applied to detect three levels of psychological stress by only using a 10 s ECG signal from a data set that is collected based on the Montreal model. The proposed framework with deep learning is more suitable for real-time applications.

Experiments and Data Acquisition
To acquire ECG signal acquisition, storage, management, and real-time analysis, we built an emotional stress-monitoring platform that contained server and front-ends. After receiving the ECG data through the 4G net, the server could analyze and then send the results back to the front-end in a timely manner. We designed the server based on Flask and used MongoDB [14] for the database, Nginx [15] for a reverse proxy, and Gunicorn [16] for concurrent processing. The software supervisor was also used to monitor the programs of the main program and the database so the system could automatically restart for continuous running when stopped for any bugs. The structure of the server is shown in Figure 1. restart for continuous running when stopped for any bugs. The structure of the server is shown in Figure 1. The front end consists of sticky devices and mobile apps. The sticky devices we used were created by our lab using the chip ADS1292 for ECG signal collection and CC2640 for data transmission as shown in Figure 2. Compared to other devices, the two-electrode structure makes the device more comfortable to wear due to small volume and a light weight. When using the sticky device was placed horizontally on the center of the chest in the fourth rib or fifth rib area. The ECG sampled at 250 Hz would be transmitted to the mobile app via Bluetooth 4.0, and finally. The server would receive the data through the 4G net. Common methods for stress inducement include color word experiments, ice water simulation, public speaking mathematical calculations, and watching horror videos. In view of the experimental feasibility and controllability of the induced degree, we chose the Montreal stress model with calculations as the main method [17]. The Montreal Imaging Stress Model was originally designed by psychologists to assess psychological stress, and it contains three processes-rest, moderate stress, and high stress inducing. The experimental procedures were developed by the group from the Institute of Psychology, Chinese Academy of Sciences. It contained three periods which last as long as 15 min, the first five minutes are a rest phase, and then a light stress phase, and heavy stress phase alternately arranged. The detailed process is shown in Figure 3. The VAS (visual analogue scale) report [18] of psychological stress was gathered from the subjects every 30 seconds as the ground truth for labeling the data. The front end consists of sticky devices and mobile apps. The sticky devices we used were created by our lab using the chip ADS1292 for ECG signal collection and CC2640 for data transmission as shown in Figure 2. Compared to other devices, the two-electrode structure makes the device more comfortable to wear due to small volume and a light weight. When using the sticky device was placed horizontally on the center of the chest in the fourth rib or fifth rib area. The ECG sampled at 250 Hz would be transmitted to the mobile app via Bluetooth 4.0, and finally. The server would receive the data through the 4G net.
shown in Figure 1. The front end consists of sticky devices and mobile apps. The sticky devices were created by our lab using the chip ADS1292 for ECG signal collection and CC data transmission as shown in Figure 2. Compared to other devices, the two-e structure makes the device more comfortable to wear due to small volume and weight. When using the sticky device was placed horizontally on the center of t in the fourth rib or fifth rib area. The ECG sampled at 250 Hz would be transmitte mobile app via Bluetooth 4.0, and finally. The server would receive the data thro 4G net. Common methods for stress inducement include color word experiments, i simulation, public speaking mathematical calculations, and watching horror vi view of the experimental feasibility and controllability of the induced degree, w the Montreal stress model with calculations as the main method [17]. The M Imaging Stress Model was originally designed by psychologists to assess psych stress, and it contains three processes-rest, moderate stress, and high stress in The experimental procedures were developed by the group from the Inst Psychology, Chinese Academy of Sciences. It contained three periods which last as 15 min, the first five minutes are a rest phase, and then a light stress phase, an stress phase alternately arranged. The detailed process is shown in Figure 3. T (visual analogue scale) report [18] of psychological stress was gathered from the every 30 seconds as the ground truth for labeling the data. Common methods for stress inducement include color word experiments, ice water simulation, public speaking mathematical calculations, and watching horror videos. In view of the experimental feasibility and controllability of the induced degree, we chose the Montreal stress model with calculations as the main method [17]. The Montreal Imaging Stress Model was originally designed by psychologists to assess psychological stress, and it contains three processes-rest, moderate stress, and high stress inducing. The experimental procedures were developed by the group from the Institute of Psychology, Chinese Academy of Sciences. It contained three periods which last as long as 15 min, the first five minutes are a rest phase, and then a light stress phase, and heavy stress phase alternately arranged. The detailed process is shown in Figure 3. The VAS (visual analogue scale) report [18] of psychological stress was gathered from the subjects every 30 seconds as the ground truth for labeling the data.  None of the computing tasks were scheduled during the rest period. At that time, we played melodious music and humorous moving pictures to the subjects to make them as relaxed as possible. This design could simulate the none-stress situations that the subjects encountered in daily life.
During the moderate stress phase, we improved the Montreal stress model by incorporating more evoking factors. Among these, we made the background color gray to give the participants a psychological hint of negativity. The subjects were asked to participate in simple double-digit addition and subtraction. The formula contained four variables, each variable ranged from a negative 20 to 20, the operators (−, +, −) are fixed. To cause moderate stress in the participant, below the formula was a result prompt, a correct rate and a 5 s time bar that implies sufficient time. If the time ran out and the subject still has not answered, the system would judge the answer wrong and let the timing bar recount. This design simulated the moderate-stress situations encountered by the subjects when working or studying. Therefore, it is easy to finish the tasks with moderate difficulty and the time required is sufficient. The actual effect is shown as shown in Figure 4. During the stage of high stress stage, to induce the subject's psychological stress as much as possible, we added bounty missions, where the participants would get a reward of 1 yuan for each correct answer and, a 1 yuan punishment for each wrong answer. The system also displays the amount of money in large red font in the middle of the interface attracting enough attention to form a sense of oppression in the subjects. The answer buttons were also zoomed out and randomly arranged. When the answer has been figured out, the subject also has to find the correct button in the remaining time, and during that finding time, the stress would increase significantly. The actual display effect is shown in Figure 5. None of the computing tasks were scheduled during the rest period. At that time, we played melodious music and humorous moving pictures to the subjects to make them as relaxed as possible. This design could simulate the none-stress situations that the subjects encountered in daily life.
During the moderate stress phase, we improved the Montreal stress model by incorporating more evoking factors. Among these, we made the background color gray to give the participants a psychological hint of negativity. The subjects were asked to participate in simple double-digit addition and subtraction. The formula contained four variables, each variable ranged from a negative 20 to 20, the operators (−, +, −) are fixed. To cause moderate stress in the participant, below the formula was a result prompt, a correct rate and a 5 s time bar that implies sufficient time. If the time ran out and the subject still has not answered, the system would judge the answer wrong and let the timing bar recount. This design simulated the moderate-stress situations encountered by the subjects when working or studying. Therefore, it is easy to finish the tasks with moderate difficulty and the time required is sufficient. The actual effect is shown as shown in Figure 4. None of the computing tasks were scheduled during the rest period we played melodious music and humorous moving pictures to the subjects as relaxed as possible. This design could simulate the none-stress situa subjects encountered in daily life.
During the moderate stress phase, we improved the Montreal str incorporating more evoking factors. Among these, we made the backgrou to give the participants a psychological hint of negativity. The subjects participate in simple double-digit addition and subtraction. The formula variables, each variable ranged from a negative 20 to 20, the operators (−, To cause moderate stress in the participant, below the formula was a re correct rate and a 5 s time bar that implies sufficient time. If the time ra subject still has not answered, the system would judge the answer wro timing bar recount. This design simulated the moderate-stress situations e the subjects when working or studying. Therefore, it is easy to finish moderate difficulty and the time required is sufficient. The actual effec shown in Figure 4. During the stage of high stress stage, to induce the subject's psycholo much as possible, we added bounty missions, where the participants woul of 1 yuan for each correct answer and, a 1 yuan punishment for each wron system also displays the amount of money in large red font in the middle o attracting enough attention to form a sense of oppression in the subject buttons were also zoomed out and randomly arranged. When the ans figured out, the subject also has to find the correct button in the remain during that finding time, the stress would increase significantly. The actua During the stage of high stress stage, to induce the subject's psychological stress as much as possible, we added bounty missions, where the participants would get a reward of 1 yuan for each correct answer and, a 1 yuan punishment for each wrong answer. The system also displays the amount of money in large red font in the middle of the interface attracting enough attention to form a sense of oppression in the subjects. The answer buttons were also zoomed out and randomly arranged. When the answer has been figured out, the subject also has to find the correct button in the remaining time, and during that Appl. Sci. 2021, 11, 3838 5 of 18 finding time, the stress would increase significantly. The actual display effect is shown in Figure 5.
Appl. Sci. 2021, 11, x FOR PEER REVIEW Figure 5. The period of high stress, simple operation, bounty task, zoom out button Different subjects have different computing capabilities. In order to a of fixed difficulty lacking enough of a challenge for the subjects with stron was a lack of participation for those subjects with weak computing abilit integrated the adaptive function of problem difficulty, adaptive function layout, and the adaptive function of the time left. When the answer accu than 40% which implied the subject's computing ability is weak, the generate the correct answer between 0 and 9, made the button's sequence the answer time to 7 s. When the answer accuracy was between 40% an implied the subject's ability is normal, the system would generate the c between 0 and 12, made the button's sequence random, and set the answe the answer accuracy was larger than 60%, which implied the subject's ab the system would generate the correct answer between 0 and 15, mad sequence random and set the answer time to 3 s.
In addition, the program could judge the participant's state from anoth When 3 consecutive questions were answered correctly, the answer time w 2/3 of the current time, conversely, when 3 consecutive questions were ans the answer time would become 4/3 of the current time. This design coul high-stress situations encountered in the subjects' work or study. Namely, reach the passing line that even if the subject worked as well as possib equivalent to 60% accuracy in this experiment. The adaptive algorithm Figure 6 below. Different subjects have different computing capabilities. In order to avoid problems of fixed difficulty lacking enough of a challenge for the subjects with strong ability, there was a lack of participation for those subjects with weak computing ability. The system integrated the adaptive function of problem difficulty, adaptive function of the button layout, and the adaptive function of the time left. When the answer accuracy was less than 40% which implied the subject's computing ability is weak, the system would generate the correct answer between 0 and 9, made the button's sequence fixed and set the answer time to 7 s. When the answer accuracy was between 40% and 60%, which implied the subject's ability is normal, the system would generate the correct answer between 0 and 12, made the button's sequence random, and set the answer time to 5 s. If the answer accuracy was larger than 60%, which implied the subject's ability is strong, the system would generate the correct answer between 0 and 15, made the button's sequence random and set the answer time to 3 s.
In addition, the program could judge the participant's state from another dimension. When 3 consecutive questions were answered correctly, the answer time would become 2/3 of the current time, conversely, when 3 consecutive questions were answered wrong, the answer time would become 4/3 of the current time. This design could simulate the highstress situations encountered in the subjects' work or study. Namely, it was hard to reach the passing line that even if the subject worked as well as possible which was equivalent to 60% accuracy in this experiment. The adaptive algorithm is shown in Figure 6 below. ppl. Sci. 2021, 11, x FOR PEER REVIEW Figure 6. The adaptive function. RR is the number of consecutive correct answers, W number of consecutive wrong answers.
A total of 34 people without cardiovascular and cerebrovascular d recruited by poster from University of Chinese Academy of Sciences to par experiment. Among them were 20 males and 14 females. The subjects' ages 20 to 35, the average being 23.4. It was confirmed that they had not participa experiments before and had no history of smoking or drinking in the prior informed the subjects of the experimental process and gave a demonstratio submitted their written informed consent. The electrocardiogram (ECG) using the sticky device with the sampling rate set at 250 Hz, then a 10average filter was used to filter out burrs, and every 10 seconds the data w to the server. The actual test scene and the collected ECG wave form are sh 7. The research ensured that the rights and interests of the subjects were fu and the research content would not cause harm or risk to the subjects. It and approved by the Institutional Review Board of Beijing Tiantan Hos Medical University. A total of 34 people without cardiovascular and cerebrovascular diseases were recruited by poster from University of Chinese Academy of Sciences to participate in the experiment. Among them were 20 males and 14 females. The subjects' ages ranged from 20 to 35, the average being 23.4. It was confirmed that they had not participated in similar experiments before and had no history of smoking or drinking in the prior two days. We informed the subjects of the experimental process and gave a demonstration before they submitted their written informed consent. The electrocardiogram (ECG) was recorded using the sticky device with the sampling rate set at 250 Hz, then a 10-point moving average filter was used to filter out burrs, and every 10 seconds the data were uploaded to the server. The actual test scene and the collected ECG wave form are shown in Figure 7. The research ensured that the rights and interests of the subjects were fully protected, and the research content would not cause harm or risk to the subjects. It was reviewed and approved by the Institutional Review Board of Beijing Tiantan Hospital, Capital Medical University. A total of 34 people without cardiovascular and cerebrovascular diseases were recruited by poster from University of Chinese Academy of Sciences to participate in the experiment. Among them were 20 males and 14 females. The subjects' ages ranged from 20 to 35, the average being 23.4. It was confirmed that they had not participated in similar experiments before and had no history of smoking or drinking in the prior two days. We informed the subjects of the experimental process and gave a demonstration before they submitted their written informed consent. The electrocardiogram (ECG) was recorded using the sticky device with the sampling rate set at 250 Hz, then a 10-point moving average filter was used to filter out burrs, and every 10 seconds the data were uploaded to the server. The actual test scene and the collected ECG wave form are shown in Figure  7. The research ensured that the rights and interests of the subjects were fully protected, and the research content would not cause harm or risk to the subjects. It was reviewed and approved by the Institutional Review Board of Beijing Tiantan Hospital, Capital Medical University.

Conventional Methods
To compare the conventional methods to the deep learning model proposed in this article, conventional methods were conducted, the characters of conventional algorithms

Conventional Methods
To compare the conventional methods to the deep learning model proposed in this article, conventional methods were conducted, the characters of conventional algorithms were manually extracted from the time domain and frequency domain of the RR interval or HRV. HRV was defined as the change in the difference of successive RR intervals. To get the RR interval, the R-peaks' position was detected from the ECG signal using the Pan Tompkins algorithm [19]. To reduce the impact of individual differences, this article used the z-score algorithm as Formula (1) below to normalize the RR interval, so that its average would become 0 and the variance becomes 1.
The time domain features influenced by momentary ANS (autonomic nervous system) activities were extracted. Such statistical features included mean-RR (the mean value of RR interval), SD (standard deviation of HRV), MED (the median value of HRV), QD (quartile deviation of RR-interval), percent 20th (the 20th percentile's value of RR-interval), percent 80th (the 80th percentile's value of RR-interval) and average heart rate. As the research [20] has shown that the high-frequency energy of HRV is related to the activity parasympathetic nerve and the low-frequency energy is related to the activity of the sympathetic nerve, then HRV was computed from the RR intervals time-series and the features of HRV were extracted, including HF (high-frequency energy value of HRV signal), LF (low-frequency energy value of HRV signal) and LF_HF (the ratio of HF power and LF power) as shown in Figure 8. Finally, the 10 features would be trained and tested using nine machine learning classifiers (XG-boost [21], logistic regression [22], RBF Radial Basis Function-SVM Support Vector Machine [23], random forest, decision tree, Linear SVM, K-nearest neighbors, Ada Boost, Naïve byes) as Table 1 shows. The performance would be further illustrated in the results section.
were manually extracted from the time domain and frequency domain of the RR interval or HRV. HRV was defined as the change in the difference of successive RR intervals. To get the RR interval, the R-peaks' position was detected from the ECG signal using the Pan Tompkins algorithm [19]. To reduce the impact of individual differences, this article used the z-score algorithm as Formula (1) below to normalize the RR interval, so that its average would become 0 and the variance becomes 1.
The time domain features influenced by momentary ANS(autonomic nervous system)activities were extracted. Such statistical features included mean-RR (the mean value of RR interval), SD (standard deviation of HRV), MED (the median value of HRV), QD (quartile deviation of RR-interval), percent 20th (the 20th percentile's value of RR-interval), percent 80th (the 80th percentile's value of RR-interval) and average heart rate. As the research [20] has shown that the high-frequency energy of HRV is related to the activity parasympathetic nerve and the low-frequency energy is related to the activity of the sympathetic nerve, then HRV was computed from the RR intervals time-series and the features of HRV were extracted, including HF (high-frequency energy value of HRV signal), LF (low-frequency energy value of HRV signal) and LF_HF (the ratio of HF power and LF power) as shown in Figure 8. Finally, the 10 features would be trained and tested using nine machine learning classifiers (XG-boost [21], logistic regression [22], RBF Radial Basis Function-SVM Support Vector Machine [23], random forest, decision tree, Linear SVM, K-nearest neighbors, Ada Boost, Naïve byes) as Table 1 shows. The performance would be further illustrated in the results section.

The Proposed Network
Most conventional machine learning methods used prior knowledge to manually extract features. Due to the limitations of prior knowledge, conventional feature extraction methods would inevitably ignore some non-linear relationship between ECG and stress. The purpose of this article was to avoid the limitations of prior knowledge by using deep learning technology.
With deep learning, we could extract features from the original ECG signal automatically and establish mapping from the original ECG for the psychological pressure to meet the real-time application (using less than 1 min ECG to speculate stress with high accuracy).
CNN is one of the most popular neural networks, and it has greatly promoted the development of image processing. Unlike RNN, its features are local connecting and weight sharing, using local connection and weight sharing CNN could automatically extract the structural features of the image in space, which makes it good at recognizing the data's displacement, scaling and rotation [24]. On the other hand, the model's complexity is greatly reduced, so it is easy to optimize compared with a fully connected network. A typical CNN net usually contains a convolution layer, a pooling layer, and a fully connected layer as shown in Figure 9. The convolutional layer performs simplified convolution operations between the input data that is intercepted by the window function and the convolution kernel. Therefore, the convolution layer can extract the features from the input data that relate to the convolution kernel.

The Proposed Network
Most conventional machine learning methods used prior knowledge to manually extract features. Due to the limitations of prior knowledge, conventional feature extraction methods would inevitably ignore some non-linear relationship between ECG and stress. The purpose of this article was to avoid the limitations of prior knowledge by using deep learning technology.
With deep learning, we could extract features from the original ECG signal automatically and establish mapping from the original ECG for the psychological pressure to meet the real-time application (using less than 1 min ECG to speculate stress with high accuracy).
CNN is one of the most popular neural networks, and it has greatly promoted the development of image processing. Unlike RNN, its features are local connecting and weight sharing, using local connection and weight sharing CNN could automatically extract the structural features of the image in space, which makes it good at recognizing the data's displacement, scaling and rotation [24]. On the other hand, the model's complexity is greatly reduced, so it is easy to optimize compared with a fully connected network. A typical CNN net usually contains a convolution layer, a pooling layer, and a fully connected layer as shown in Figure 9. The convolutional layer performs simplified convolution operations between the input data that is intercepted by the window function and the convolution kernel. Therefore, the convolution layer can extract the features from the input data that relate to the convolution kernel. As different convolution kernels can extract different features, increasing the number of feature kernels within a certain range can increase the computing time. The pooling layer after convolution mainly plays the role of down sampling, which can highlight the effective ingredient of the features and reduce the amount of calculations. Taking into account the CNN's ability at extracting local features, we designed the process to use CNN instead of manually extracting features via conventional methods and also by increasing the number of convolution kernels to increase the number of features. The choice of CNN was based on the fact that it does not require feature engineering.
Traditional neural networks assumed that all the data entered before and after are independent. Therefore, it was not possible to take advantage of the progressive relationship within the time series signals, such as language. RNN [25] is one of the most popular networks, as it advances through time by a fixed step size and continuously generates historical status, in the process of advancement, the historical information of the previous moment is also taken as a part of the input, so it can use the back-and-forth connection of time-series signals. It showed excellent performance in the field of natural As different convolution kernels can extract different features, increasing the number of feature kernels within a certain range can increase the computing time. The pooling layer after convolution mainly plays the role of down sampling, which can highlight the effective ingredient of the features and reduce the amount of calculations. Taking into account the CNN's ability at extracting local features, we designed the process to use CNN instead of manually extracting features via conventional methods and also by increasing the number of convolution kernels to increase the number of features. The choice of CNN was based on the fact that it does not require feature engineering.
Traditional neural networks assumed that all the data entered before and after are independent. Therefore, it was not possible to take advantage of the progressive relationship within the time series signals, such as language. RNN [25] is one of the most popular networks, as it advances through time by a fixed step size and continuously generates historical status, in the process of advancement, the historical information of the previous moment is also taken as a part of the input, so it can use the back-and-forth connection of time-series signals. It showed excellent performance in the field of natural language processing. LSTM (as shown in Figure 10) is a kind of RNN which uses the forget gate and memory gate to address the disadvantages of short-term memory. language processing. LSTM (as shown in Figure 10) is a kind of RNN which uses the forget gate and memory gate to address the disadvantages of short-term memory. Throughout the entire time series of ECG, the state of the current moment was also affected by the future moment. The characteristics of the next moment could also be used to predict the current moment's status. BiLSTM (shown in Figure 11) is a widely used variant of LSTM [26]. It consists of two LSTM units that can extract the features from both positive and negative sequences separately and then jointly determine the output. There are generally three ways to combine the outputs of two LSTM units, including addition, multiply and series connection (as shown in Figure 12). BiLSTM were used to further extract features from the whole time, considering that ECG belongs to time series signal that is suitable for BiLSTM. The three methods have been tried and the results are given in the result section.  language processing. LSTM (as shown in Figure 10) is a kind of RNN which uses the forget gate and memory gate to address the disadvantages of short-term memory. Throughout the entire time series of ECG, the state of the current moment was also affected by the future moment. The characteristics of the next moment could also be used to predict the current moment's status. BiLSTM (shown in Figure 11) is a widely used variant of LSTM [26]. It consists of two LSTM units that can extract the features from both positive and negative sequences separately and then jointly determine the output. There are generally three ways to combine the outputs of two LSTM units, including addition, multiply and series connection (as shown in Figure 12). BiLSTM were used to further extract features from the whole time, considering that ECG belongs to time series signal that is suitable for BiLSTM. The three methods have been tried and the results are given in the result section. language processing. LSTM (as shown in Figure 10) is a kind of RNN which uses the forget gate and memory gate to address the disadvantages of short-term memory. Throughout the entire time series of ECG, the state of the current moment was also affected by the future moment. The characteristics of the next moment could also be used to predict the current moment's status. BiLSTM (shown in Figure 11) is a widely used variant of LSTM [26]. It consists of two LSTM units that can extract the features from both positive and negative sequences separately and then jointly determine the output. There are generally three ways to combine the outputs of two LSTM units, including addition, multiply and series connection (as shown in Figure 12). BiLSTM were used to further extract features from the whole time, considering that ECG belongs to time series signal that is suitable for BiLSTM. The three methods have been tried and the results are given in the result section. Throughout the entire time series of ECG, the state of the current moment was also affected by the future moment. The characteristics of the next moment could also be used to predict the current moment's status. BiLSTM (shown in Figure 11) is a widely used variant of LSTM [26]. It consists of two LSTM units that can extract the features from both positive and negative sequences separately and then jointly determine the output. There are generally three ways to combine the outputs of two LSTM units, including addition, multiply and series connection (as shown in Figure 12). BiLSTM were used to further extract features from the whole time, considering that ECG belongs to time series signal that is suitable for BiLSTM. The three methods have been tried and the results are given in the result section.
language processing. LSTM (as shown in Figure 10) is a kind of RNN which uses the forget gate and memory gate to address the disadvantages of short-term memory. Figure 10. The structure of long short-term memory (LSTM) unit, f is the forget gate, m is the memory gate. reprensents multiply by element.
represents add by element.
Throughout the entire time series of ECG, the state of the current moment was also affected by the future moment. The characteristics of the next moment could also be used to predict the current moment's status. BiLSTM (shown in Figure 11) is a widely used variant of LSTM [26]. It consists of two LSTM units that can extract the features from both positive and negative sequences separately and then jointly determine the output. There are generally three ways to combine the outputs of two LSTM units, including addition, multiply and series connection (as shown in Figure 12). BiLSTM were used to further extract features from the whole time, considering that ECG belongs to time series signal that is suitable for BiLSTM. The three methods have been tried and the results are given in the result section.  The proposed network we designed is shown in Figure 13. First, one-dimensional convolution was used to extract the local temporal features. In CNN, LRN (local response normalization) was used to refine the classification boundary of the model and highlight the contrast of features. The mechanism of LRN [23] mimics the lateral inhibition phenomenon of neuroscience in that the activated neurons will suppress nearby neurons, which could generate competition in the network where the effective features are strengthened and the ineffective features are diminished during training.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 10 of 19 Figure 12. The three kinds of output of BiLSTM generated by three kinds of joint from LSTM's output.
The proposed network we designed is shown in Figure 13. First, one-dimensional convolution was used to extract the local temporal features. In CNN, LRN (local response normalization) was used to refine the classification boundary of the model and highlight the contrast of features. The mechanism of LRN [23] mimics the lateral inhibition phenomenon of neuroscience in that the activated neurons will suppress nearby neurons, which could generate competition in the network where the effective features are strengthened and the ineffective features are diminished during training.
In the activation layer, RELU (rectified linear unit) was used for non-linear transformation of the model. As the generalization ability and the computational complexity were determined by the parameters of BiLSTM and CNN, this article through repeated attempts to determine the optimal parameters. After the BiLSTM layers, a fully connected layer was used to generate maps between the extracted features and the three levels of psychological stress. At the end, the softmax layer calculated the probability of the individual to whom the current sample belongs. For training, cross information entropy was selected as the loss function. To maximize the generalization ability of the model, the L1 regular term [27] of the fully connected layer's weights was also added to the total loss, and that could increase the scarcity of the model.
Before the training, the initial learning rate was set to 0.6, the batch size was 30 and the epochs, 3000. This setting implied that the model randomly selects 30 samples for training each time, 3000 times in total. During the training, the network updated parameter values through adaptive stochastic gradient descent algorithm. With the Adadelta algorithm, the learning rate could be adaptively adjusted for better training results. The dropout layer method was used to randomly select a part of the nodes to participate in training each time, so that the model could get more diverse training. The proposed methods would be described in detail in the 'Results' section.
We evaluated our deep neural network on 3098 10 s ECG signals divided randomly into training set (80% of the data set) and testing set (20%). Each deep learning model was evaluated by five-fold cross-validation and the results were averaged. The experiments were performed on a computer with an Intel Core i7 processor, 16 GB RAM and Tensorflow 1.14.0

Results
To achieve real-time stress level detection, we designed a deep neural network and more conventional methods which rely on hand-crafted features and compared their performance.
To conduct conventional methods, we preprocessed the ECG signal and extracted 10 In the activation layer, RELU (rectified linear unit) was used for non-linear transformation of the model. As the generalization ability and the computational complexity were determined by the parameters of BiLSTM and CNN, this article through repeated attempts to determine the optimal parameters. After the BiLSTM layers, a fully connected layer was used to generate maps between the extracted features and the three levels of psychological stress. At the end, the softmax layer calculated the probability of the individual to whom the current sample belongs. For training, cross information entropy was selected as the loss function. To maximize the generalization ability of the model, the L1 regular term [27] of the fully connected layer's weights was also added to the total loss, and that could increase the scarcity of the model.
Before the training, the initial learning rate was set to 0.6, the batch size was 30 and the epochs, 3000. This setting implied that the model randomly selects 30 samples for training each time, 3000 times in total. During the training, the network updated parameter values through adaptive stochastic gradient descent algorithm. With the Adadelta algorithm, the learning rate could be adaptively adjusted for better training results. The dropout layer method was used to randomly select a part of the nodes to participate in training each time, so that the model could get more diverse training. The proposed methods would be described in detail in the 'Results' section.
We evaluated our deep neural network on 3098 10 s ECG signals divided randomly into training set (80% of the data set) and testing set (20%). Each deep learning model was evaluated by five-fold cross-validation and the results were averaged. The experiments were performed on a computer with an Intel Core i7 processor, 16 GB RAM and Tensorflow 1.14.0

Results
To achieve real-time stress level detection, we designed a deep neural network and more conventional methods which rely on hand-crafted features and compared their performance.
To conduct conventional methods, we preprocessed the ECG signal and extracted 10 features from the time and frequency domains as shown in Figure 8. Then 9 machine learning methods were used to detect the stress level. The results of these models are evaluated based on different window sizes of 1 min and 10 s, respectively. The result based on 1 min is shown in Table 2 where these nine algorithms finally obtained an average accuracy of 0.647, an average recall of 0.557, and an average specificity of 0.810. When using the window of 10 s, the performance of these models decreased. As shown in Table 3, the average accuracy rate dropped to 0.563, the average recall rate dropped to 0.457 and the average specificity dropped to 0.731. The reason is that extracting features of frequency for a shorter window is less effective.  Since the goal of this article is to build a model for real-time application, the performances of the proposed framework and the conventional methods for the 10 s window will be compared and analyzed. Among the conventional models, XGboost obtained the highest accuracy of 0.637, the highest recall of 0.492, and the highest specificity of 0.765. The confusion matrices of XGboost is shown in Figure 14. It could be seen from the confusion matrix that XGboost achieved a recall for moderate stress at 0.916, but produced a poor recognition rate on low stress. It also misclassified most high stress states into moderate stress. It was especially difficult to correctly detect both stress and high stress for the conventional models in this particular experiment. The ROC curve of XGBoost (as Figure 15 shows) also indicated that its ability of recognizing high pressure is lower than that of the medium pressure and the low pressure.  Since the goal of this article is to build a model for real-time application, the performances of the proposed framework and the conventional methods for the 10 s window will be compared and analyzed. Among the conventional models, XGboost obtained the highest accuracy of 0.637, the highest recall of 0.492, and the highest specificity of 0.765. The confusion matrices of XGboost is shown in Figure 14. It could be seen from the confusion matrix that XGboost achieved a recall for moderate stress at 0.916, but produced a poor recognition rate on low stress. It also misclassified most high stress states into moderate stress. It was especially difficult to correctly detect both stress and high stress for the conventional models in this particular experiment. The ROC curve of XGBoost (as Figure 15 shows) also indicated that its ability of recognizing high pressure is lower than that of the medium pressure and the low pressure.  To determine the optimal parameters of the proposed network, we conducted various experiments. Initially, we preset the number of n_ inputs to 5, the stride of the pooling layer to 8 and use stitching to deal with the output of BiLSTM. Because the ECG signal is quasi-periodic, we set the length of convolutional filter to 200 which corresponds to the average RR interval (0.8 s) at the sampling rate of 250 Hz.
The performance of our initial network was as shown in Figure 16 left, and the accuracy with 4 conv filters was only 0.819. We then tried to increase the number of features by augmenting the number of convolution kernels. In Figure 16, it can be seen that as the number of filters increases, the accuracy rate continues to increase, and when the number of filters reaches 32, the network achieved the maximum accuracy of 0.863. Then, the number of convolutions further increased, the accuracy decreased instead, which implied that when the number of filters exceeded 32, no more valid features could be extracted, and the exceeded filters even caused over fitting, which reduced the accuracy of the model in the test set. The 32 was used as the optimal parameter for the conv filter. To determine the optimal parameters of the proposed network, we conducted various experiments. Initially, we preset the number of n_ inputs to 5, the stride of the pooling layer to 8 and use stitching to deal with the output of BiLSTM. Because the ECG signal is quasi-periodic, we set the length of convolutional filter to 200 which corresponds to the average RR interval (0.8 s) at the sampling rate of 250 Hz.
The performance of our initial network was as shown in Figure 16 left, and the accuracy with 4 conv filters was only 0.819. We then tried to increase the number of features by augmenting the number of convolution kernels. In Figure 16, it can be seen that as the number of filters increases, the accuracy rate continues to increase, and when the number of filters reaches 32, the network achieved the maximum accuracy of 0.863. Then, the number of convolutions further increased, the accuracy decreased instead, which implied that when the number of filters exceeded 32, no more valid features could be extracted, and the exceeded filters even caused over fitting, which reduced the accuracy of the model in the test set. The 32 was used as the optimal parameter for the conv filter. Based on the fixed optimal conv filter number, other experiments were conducted, and a similar phenomenon could also be observed in the experiment of BiLSTM size. In the experiment of BiLSTM size, as shown in Figure 17, the accuracy is not increased at the size of 64 compared with the size of 32. However, the time for training did greatly increase. Considering the accuracy and training time, 32 BiLSTM units were used for the proposed network. Based on the fixed optimal conv filter number, other experiments were conducted, and a similar phenomenon could also be observed in the experiment of BiLSTM size. In the experiment of BiLSTM size, as shown in Figure 17, the accuracy is not increased at the size of 64 compared with the size of 32. However, the time for training did greatly increase. Considering the accuracy and training time, 32 BiLSTM units were used for the proposed network.
Based on the fixed optimal conv filter number, other experiments were conducted, and a similar phenomenon could also be observed in the experiment of BiLSTM size. In the experiment of BiLSTM size, as shown in Figure 17, the accuracy is not increased at the size of 64 compared with the size of 32. However, the time for training did greatly increase. Considering the accuracy and training time, 32 BiLSTM units were used for the proposed network. The size of pooling window and n-inputs of BiLSTM was determined based on the maximum accuracy. As shown in Figure 18, the proposed network achieved the highest accuracy when the pool length was set to 8. The set implies that for the features generated by the convolutional layer, the pooling layer will downsample using a window of size 8. As shown in Figure 19, the proposed network achieved the highest accuracy when the n_input was set to 50. The set implies that BiLSTM could further extract more abstractive features from each 50 initial features that CNN was able to extract. The size of pooling window and n-inputs of BiLSTM was determined based on the maximum accuracy. As shown in Figure 18, the proposed network achieved the highest accuracy when the pool length was set to 8. The set implies that for the features generated by the convolutional layer, the pooling layer will downsample using a window of size 8. As shown in Figure 19, the proposed network achieved the highest accuracy when the n_input was set to 50. The set implies that BiLSTM could further extract more abstractive features from each 50 initial features that CNN was able to extract.    To reduce training time without affecting the performance of the proposed network, we also attempted to reduce redundant features and save training time by adjusting the size of the stride in the pooling layer. At the beginning, we set the size of the stride to 2. This implies that there will be 0.5 times features that the Conv layer extracted when inputting to the BiLSTM layer. As seen in Figure 20, until the stride size increased to 8, the accuracy of the model began to decline, which implies that the size of 8 could filter out redundant features to the maximum. Finally, for the outputs of BiLSTM, we tried three stitching methods based on the previously obtained optimization parameters. As shown in Figure 21, the method of addition which was adopted by the proposed network obtained the highest accuracy. To reduce training time without affecting the performance of the proposed network, we also attempted to reduce redundant features and save training time by adjusting the size of the stride in the pooling layer. At the beginning, we set the size of the stride to 2. This implies that there will be 0.5 times features that the Conv layer extracted when inputting to the BiLSTM layer. As seen in Figure 20, until the stride size increased to 8, the accuracy of the model began to decline, which implies that the size of 8 could filter out redundant features to the maximum. Finally, for the outputs of BiLSTM, we tried three stitching methods based on the previously obtained optimization parameters. As shown in Figure 21, the method of addition which was adopted by the proposed network obtained the highest accuracy.   The proposed network obtained the accuracy of 0.865, and a specificity of 0.928. Compared to the conventional methods (XGboost), our proposed network improves accuracy by 0.228. The ROC curves of the proposed network and the XGBoost are as shown in Figures 15 and 22, the area under the curve (AUC) in moderate stress was 0.88 improved by 0.1 compared with XGboost's, the AUC in high stress was 0.85 improved by 0.17. The confusion matrix of the proposed network was as shown in Figure 23, the recall of low stress was 0.913, the recall of moderate stress was 0.894, and the recall of high stress was 0.798. It could thus be concluded that the proposed network produces a significant increase in the classification of the stress on three levels compared to the conventional methods. The proposed network obtained the accuracy of 0.865, and a specificity of 0.928. Compared to the conventional methods (XGboost), our proposed network improves accuracy by 0.228. The ROC curves of the proposed network and the XGBoost are as shown in Figures 15 and 22, the area under the curve (AUC) in moderate stress was 0.88 improved by 0.1 compared with XGboost's, the AUC in high stress was 0.85 improved by 0.17. The confusion matrix of the proposed network was as shown in Figure 23, the recall of low stress was 0.913, the recall of moderate stress was 0.894, and the recall of high stress was 0.798. It could thus be concluded that the proposed network produces a significant increase in the classification of the stress on three levels compared to the conventional methods.

Discussion and Conclusions
This paper proposes a model for psychological stress detection by using deep learning technology and obtaining the optimal parameters of that model by various experiments. For a comparison, we also implemented the conventional methods, and the results showed our proposed network obtained significant improvement in both medium stress and high stress detection compared to the conventional methods using 10 s ECG. This finding implies our model is more satisfactory for real-time application. That conclusion is mainly due to the use of CNN and BiLSTM. CNN's excellent ability when dealing with local features and BiLSTM's excellent ability when dealing with long-term sequence signals enable our network extracting features automatically to overcome the limitations of conventional methods by using prior knowledge for extracting key features.
Individual differences, especially personality differences are important factors affecting emotional stress. In the process of building the proposed network, we did not take into account the personality differences of the experimenter, exploring how to improve the performance of the proposed network using personality differences will become our future work.

Discussion and Conclusions
This paper proposes a model for psychological stress detection by using deep learning technology and obtaining the optimal parameters of that model by various experiments. For a comparison, we also implemented the conventional methods, and the results showed our proposed network obtained significant improvement in both medium stress and high stress detection compared to the conventional methods using 10 s ECG. This finding implies our model is more satisfactory for real-time application. That conclusion is mainly due to the use of CNN and BiLSTM. CNN's excellent ability when dealing with local features and BiLSTM's excellent ability when dealing with long-term sequence signals enable our network extracting features automatically to overcome the limitations of conventional methods by using prior knowledge for extracting key features.
Individual differences, especially personality differences are important factors affecting emotional stress. In the process of building the proposed network, we did not take into account the personality differences of the experimenter, exploring how to improve the performance of the proposed network using personality differences will become our future work. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. As the data were generated during this study, we did not find an appropriate platform to share the data.

Conflicts of Interest:
The authors declare no conflict of interest.