#### 3.1. Subjects

We recruited 18 subjects (8 females and 10 males, aged 24.6 ± 4.6) from Pohang University of Science and Technology (POSTECH), South Korea, via open recruiting. The study subjects were selected based on the following criteria: (1) they had no cardiovascular disease or mental problems and (2) they had not undertaken intense exercise before the day of the test and (3) they had not had caffeinated beverages on the day of the test. This study was approved by the POSTECH Ethics Committee (PIRB-2019-E001).

The experiments were conducted in parallel with the recruiting process. Although most experiments proceeded normally, unexpected technical problems (namely a subject’s carelessness and an unexpected Windows OS update) occurred during two experiments, meaning that these two subjects’ data were not captured correctly. Thus, we only considered the datasets collected from the remaining 16 subjects.

#### 3.3. Machine Learning Approaches

To compare our deep learning approach with conventional machine learning approaches, we also developed several machine learning models for use as benchmarks. Here, we selected ECG and RESP features that have been used in many previous studies [

11,

12,

17,

18,

19].

We extracted 11 handcrafted features from the ECG data, including four time-domain features and seven frequency-domain features (

Table 1). As time-domain features, we extracted the mean HR (HR mean), standard deviation of the Normal-to-Normal (NN) interval (sdNN), root mean square of successive difference of R peak-to-R peak (RR) intervals (rmssd), and percentage of the differences between adjacent RR intervals that were greater than 50 ms (pNN50). As frequency-domain features, we extracted the NN interval powers in the following ranges: 0.00–0.04 Hz (VLF), 0.04–0.15 Hz (LF), 0.15–0.40 Hz (HF), and 0.14–0.40 Hz (TF). In addition, we included the ratios of LF to LF+HF (nLF), HF to LF+HF (nHF), and LF to HF (LF2HF) as frequency-domain features.

We also extracted a total of eight handcrafted RESP features: three time-domain features and five frequency-domain features (

Table 1). As time-domain features, we used the square root of the mean squared RESP (RMS), interquartile range (IQR), and mean difference between adjacent elements of each RESP segment (MDA). As frequency-domain features, we used the powers in the 0.00–1.00 Hz (LF1), 1.00–2.00 Hz (LF2), 2.00–3.00 Hz (HF1), and 3.00–4.00 Hz (HF2) ranges, as well as the LF1 + LF2 to HF1 + HF2 ratio (L2H). As with the ECG frequency-domain features, the RESP features were also computed using Welch’s method of estimating the data’s power spectral density.

Then, we developed several machine learning models that have previously been proposed to classify stress states [

20]. While the models were being trained and evaluated, the features were normalized by using a MinMax scaler to bring them into the 0–1 range. To prevent data leakage during training, the scaler parameters were fitted using only the training set features, but used to normalize both the training and test set features. We tuned the models’ hyper-parameters via grid search and calculated their average performance using five-fold cross validation.

#### 3.4. Deep Learning Approaches

Unlike machine learning approaches, deep learning approaches are based on deep neural networks that can directly extract features from the data, and are not reliant on well-defined handcrafted features. As the name implies, deep neural networks are artificial neural networks with two or more hidden layers. Having many hidden layers enables such networks to learn more complex nonlinear patterns and hierarchical information than would be possible with shallow networks. Despite these advantages, however, deep neural networks also usually have a large number of parameters, which can lead to over-fitting, and they can experience issues with the gradient vanishing when they have a large number of layers. These problems can result in a failure to learn and an increase in generalization errors. To overcome these limitations, recent algorithmic advances (e.g., rectified linear units, batch normalization, dropout, stochastic gradient descent, and data augmentation), more powerful computational hardware (e.g., general-purpose graphical processor units), and innovative network architectures, such as CNNs and LSTMs, have partially resolved these over-fitting and gradient vanishing problems, enabling high performance to be achieved. These developments have encouraged the use of deep learning approaches in numerous fields, including physiological signal analysis [

21] and stress recognition [

5,

12,

15].

We designed our proposed network based on Deep ECG Net’s structure [

12]. First, a batch-normalization layer is used to normalize each physiological signal, so that the network can learn to normalize the signals based on the data itself. Then, there is a 1D convolutional layer and a 1D max-pooling layer for each signal, which extract stress-related waveform patterns from the ECG and RESP data. Here, a rectified linear unit (ReLU) is used as the activation function. Next, comes another 1D convolutional layer. There is no additional max-pooling layer this time because the previous max-pooling process has greatly reduced the dimensionality. After that, there are multiple LSTM layers, in order to obtain sequential information about the features extracted from the previous convolutional layer. Next, we concatenate the extracted ECG and RESP features and add a dense layer. Finally, there is a fully-connected layer with a sigmoid activation function, which classifies the data as stressed or unstressed. To prevent over-fitting, we also add dropout and batch-normalization layers.

Figure 3 shows the structure of the proposed DeepER (ECG–RESP) Net.

As noted by the developer of Deep ECG Net [

12], both the first 1D convolutional layer’s kernel length and 1D max-pooling layer’s pooling length are important factors. They determined that a kernel length of 0.6 s (i.e., 600 points at a sampling frequency of 1 kHz) and a pooling length of 0.8 s (i.e., 800 points) were optimal. These choices are very plausible. First, the length of the PQRST of the ECG is the sum of its PR and QT intervals that is between 0.57 and 0.67 [

12]. Thus, selecting a value between them is reasonable as a kernel length. Furthermore, to apply a max-pooling operation of an interval including at least one R peak that is related to HR and HRV, an average heart rate period (about 0.8 s) can be a considerable candidate. Based on these heuristic choices, we designed our first 1D convolutional layer to have the same kernel and max-pooling lengths (0.6 s and 0.8 s, respectively) for processing the ECG data. The kernel and max-pooling lengths of the network designed to process RESP data were designed similarly: a single respiration period was used for the kernel and max-pooling lengths. Because the RESP pattern is simple and split into by an expiration (e.g., nadir) and an inspiration (e.g., peak), the size is sufficient to extract RESP’s features. Because adults normally respire 12–20 times per min [

22], we set both lengths to 5 s (i.e., 125 points at a sampling frequency of 25 Hz).

Our proposed network has 50 filters in each of the initial 1D convolutional layers, which has a stride of 1. For the ECG network, there are 50 filters in the second 1D convolutional layer, which has a kernel length of 25 and a stride of 1. For the RESP network, there are 50 filters in the second 1D CNN layer, which has a kernel length of 4 and a stride of 1. Zero-padding was used in all the convolutional layers to maintain the input size. There are 32 and 16 units in the first and second LSTM layers, respectively, and 512 units in the dense layer. The second 1D convolutional layers in the ECG and RESP networks have kernel lengths of 25 and 4, respectively, so as to focus on the same time interval (20 s). All dropout layers have a dropout rate of 0.5 and the weight decay’s regularization strength is ${10}^{-4}$.

For training, we used the Adam [

23] optimizer with a learning rate of

${10}^{-3}$ and a step decay scheduler (i.e., the learning rate is halved every 50 epochs). The binary cross-entropy loss was used to calculate the losses between the labels and predictions, as follows:

We used a total of 250 epochs, a batch size of 32, and a 0.3 validation split (i.e., 30% of the training set). Finally, the model with the lowest loss on the validation set after 250 epochs was used for evaluation. As with the machine learning models, we used five-fold cross validation to evaluate the performance of the network.

All training processes were conducted using the well-known Keras deep learning library, with Python 2.7 running under Ubuntu 16.01.5, on a PC with a 3.6 GHz Intel Core i7 processor, 128 GB of RAM, and 4 NVIDIA GeForce GTX1080 Ti (Santa Clara, CA, USA).

#### 3.5. Metrics

Because this is a binary classification problem (i.e., the subject is stressed or unstressed), we used the following metrics to evaluate both the deep learning network and the machine learning models:

Here, TP (true positive) is the number of cases correctly classified as “stressed,” while TN (true negative) is the number of cases correctly classified as “unstressed.” Likewise, FP (false positive) is the number of cases that were classified as “stressed” but were actually “unstressed,” while FN (false negative) is the number of cases that were classified as “unstressed” but were actually “stressed.” The first metric (accuracy) is the percentage of cases that were correctly predicted, while the second (F1 score) is the harmonic mean of the precision and recall, which indicates the trade-off between these two metrics.

In addition to the accuracy and F1 score, we also used the area under the receiver operating characteristic (ROC) curve to evaluate the models. The area under the ROC curve (AUC) is a well-known model accuracy metric [

24]. By calculating each sensitivity and specificity according to probability thresholds, which is within 0 to 1, the ROC is independent of the different thresholds and thus the metric is reliable and reflects the average performance with the thresholds. Models with AUCs above 0.9 are considered to be accurate [

24].