Deep Learning Models for Stress Analysis in University Students: A Sudoku-Based Study

Due to the phenomenon of “involution” in China, the current generation of college and university students are experiencing escalating levels of stress, both academically and within their families. Extensive research has shown a strong correlation between heightened stress levels and overall well-being decline. Therefore, monitoring students’ stress levels is crucial for improving their well-being in educational institutions and at home. Previous studies have primarily focused on recognizing emotions and detecting stress using physiological signals like ECG and EEG. However, these studies often relied on video clips to induce various emotional states, which may not be suitable for university students who already face additional stress to excel academically. In this study, a series of experiments were conducted to evaluate students’ stress levels by engaging them in playing Sudoku games under different distracting conditions. The collected physiological signals, including PPG, ECG, and EEG, were analyzed using enhanced models such as LRCN and self-supervised CNN to assess stress levels. The outcomes were compared with participants’ self-reported stress levels after the experiments. The findings demonstrate that the enhanced models presented in this study exhibit a high level of proficiency in assessing stress levels. Notably, when subjects were presented with Sudoku-solving tasks accompanied by noisy or discordant audio, the models achieved an impressive accuracy rate of 95.13% and an F1-score of 93.72%. Additionally, when subjects engaged in Sudoku-solving activities with another individual monitoring the process, the models achieved a commendable accuracy rate of 97.76% and an F1-score of 96.67%. Finally, under comforting conditions, the models achieved an exceptional accuracy rate of 98.78% with an F1-score of 95.39%.


Introduction
In this current generation, college and university students encounter a multitude of stressors, particularly in China, where the concept of "involution" gained prevalence around 2020. Academic-related stress has emerged as a significant source of anxiety for Chinese students, encompassing pressures to attain top grades, worries about receiving lower scores, concerns regarding the availability of internship opportunities compared to their peers, and anxieties about securing graduate study prospects.
Involution, also known as "Nei Juan" in Chinese pinyin, refers to situations in work and study where individuals put in extra effort, but that effort does not yield a proportional outcome [1]. For instance, the process of securing a graduate study opportunity in a university has become more challenging. In the past, students with relatively high grade point averages (GPAs) would be admitted based on limited quotas. However, admission offices now consider additional factors such as extracurricular activities and published academic papers. Consequently, students are compelled to invest extra effort in internships, extracurricular activities, and academic research to enhance their chances of admission.

Related Work
Wagh et al. [6] employed Support Vector Machine (SVM), k-Nearest Neighbor (kNN), and Decision Tree (DT) algorithms to classify positive, neutral, and negative emotions using time and time-frequency domain features extracted from various channels of electroencephalogram (EEG) data. The classifiers were trained on the SJTU emotion EEG dataset (SEED), resulting in an accuracy of 72.46% for DT and 60.19% for kNN. Vijayakumar et al. [7] developed a 1D convolutional neural network (CNN) to classify arousal, valence, and liking based on peripheral physiological signals, including blood volume pressure (BVP), horizontal electrooculogram (hEOG), vertical electrooculogram (vEOG), trapezius electromyogram (tEMG), zygomaticus electromyogram (zEMG), respiration rate (RSP), and skin temperature (SKT) data. The CNN model achieved accuracies of 77.03% for arousal, 68.75% for valence, and 74.68% for liking using time and frequency domain features. Miao et al. [8] proposed a parallel spatial-temporal 3D deep residual learning framework called MFBPST-3D-DRLF for emotion recognition using EEG signals. This framework utilized multiple frequency bands (delta, theta, alpha, beta, gamma) of EEG signals to generate a 3D representation of features, which were then trained using a 3D deep residual CNN model. It achieved a classification accuracy of 96.67% on the SEED dataset (positive, neutral, and negative emotions) and 88.21% on the SEED-IV dataset (happy, fear, sad, and neutral).
Montero Quispe et al. [9] employed a novel self-supervised learning approach for emotion recognition, consisting of two stages: self-supervised pre-training and emotion recognition model training. In the pre-training stage, the model learned to recognize six signal variants generated by applying noise, scaling, negation, flipping, permuting, and time warping to the original data. This approach aimed to capture the main characteristics of the data. Results showed that self-supervised learning outperformed fully supervised learning methods in classifying arousal and valence using EEG and electrocardiogram (ECG) data. Tang et al. [10] conducted experiments on emotion recognition using EEG data with a proposed model called Spatial-Temporal Information Learning Network (STILN), which achieved an accuracy of 68.31% for arousal and 67.52% for valence. Choi et al. [11] proposed an attention-LRCN model that reduced motion artifacts in collected photoplethysmography (PPG) data. By combining the attention module with a baseline model and utilizing frequency domain features of PPG as inputs, the proposed model achieved an accuracy of 97.11%. Li et al. [12] developed a 1D convolutional neural network and a multi-layer percep- tron to enhance the accuracy of stress detection using the WESAD dataset. They achieved exceptional results, with an accuracy rate of 99.80% for binary classification and an accuracy rate of 99.55% for 3-class classification. Arsalan et al. [13] focused on detecting stress during public speaking activities by collecting EEG, Galvanic Skin Response (GSR), and PPG data. They extracted frequency domain and time domain features and employed support vector machines (SVMs) with radial basis function. Their approach achieved a stress classification accuracy of 96.25% and an impressive F1-score of 95.99%. Han et al. [14] gathered and extracted features from ECG, PPG, and GSR, achieving an accuracy of 94.55% using 10-fold cross-validation and 81.82% in a real-world setting with kNN. Rastgoo et al. [15] developed a stress detection system that dynamically monitors driver stress during driving. They incorporated multiple data sources such as ECG, steering wheel, and weather conditions to predict the driver's stress level. Their approach, combining CNN and long short-term memory (LSTM) networks, achieved an accuracy of 92.80% and an F1-score of 94.56%.
In general, biosignals such as PPG, ECG, and EEG signals have demonstrated their utility in analyzing both physical health status and mental states, including emotions, concentration, and stress. Several public datasets, such as AMIGOS [12], DREAMER [13], SWELL (Smart Reasoning for Well-being at Home and at Work) [14], SEED [15], DEAP [16], and WESAD [17], have been curated specifically for research purposes and contain carefully designed experiments to induce stress in participants while collecting these biosignals. EEG, ECG, and GSR signals were obtained from sets of videos in the AMIGOS dataset. Similarly, the DREAMER dataset and SEED dataset also employed film clips to evoke positive, neutral, and negative emotions, gathering participants' ECG and EEG signals. The DEAP dataset utilized music videos to elicit emotions across different arousal-valence quadrants, capturing EEG and peripheral signals. The WESAD dataset utilized video clips and the Trier Social Stress Test as stimuli, collecting peripheral signals from participants. In the SWELL dataset, participants were exposed to a range of stressors, including time pressure, within a genuine office environment. They were requested to engage in activities such as report writing and giving presentations while their ECG data were systematically gathered. These publicly available datasets have been adapted for a wide range of research focusing on emotion recognition and stress detection. Meanwhile, deep learning methods, including CNN and LSTM, yielded higher accuracies compared to traditional machine learning methods such as SVM and DT.
The majority of public datasets currently available for developing stress detection algorithms utilize video and audio stimuli. The SWELL dataset has published a stress dataset whose data are collected under real-life office scenarios. However, there is no existing stress dataset found that collects data in a school context. Students are often studying and engaging in challenging problem-solving tasks, resulting in physiological signal changes that differ greatly from those observed during passive video viewing. To address this issue, this study devised a stress induction protocol that combined different levels of problem-solving (Sudoku game at medium and hard difficulty levels) with environmental stimuli such as videos and audio, creating a context in which students were required to solve problems amidst various distractions.

Experiment Design
In order to replicate a stress-inducing environment, participants were tasked with completing multiple Sudoku games under different scenarios, including both noisy and noise-free conditions. Throughout the experiments, various physiological data, including ECG, PPG, and EEG, were collected from the participants using wearable sensors. Following each Sudoku game, participants were asked to assess their own stress levels by completing a questionnaire. All participants were healthy students from the campus who volunteered for the experiment by filling out a registration form posted on the university's forum. The registration form included a series of questions, such as their medical history related to heart diseases, favorite music, and music that caused discomfort. Students with heart-related conditions were excluded from participating in the experiment. The registration form was created using Wenjuanxing [18], a platform that facilitates the design of questionnaires, exams, voting systems, and rating forms.
A total of 30 participants, consisting of 6 males and 24 females, with a mean age of 20.4, were recruited for the study. Among the participants, 7 were from the Faculty of Science and Engineering (FOSE), 12 were from the Nottingham University Business School (NUBS), and 11 were from the Faculty of Humanity and Social Science (FHSS). In terms of academic status, there were 15 sophomore students, 4 junior students, 8 senior students, and 3 Ph.D. students, as shown in Table 1. Upon registration, participants were informed of the experiment's location and schedule based on their availability. Each experimental session involved only one participant and took place in a small meeting room equipped with one desk, a couple of chairs, and a television with a functioning speaker. The participants were given a brief introduction to the purpose of the experiment and were asked to sign a consent form and an information sheet if they agreed to participate before the experiment began. As the experiment involved solving Sudoku puzzles, the subjects were briefed on the basic rules of Sudoku. An iPad device running a Sudoku app that generated puzzles with the highest difficulty level was used. Following this, the subjects were fitted with sensors to collect physiological data. The experiment commenced only after confirming that the data receiver terminal was functioning correctly.
For each experiment, the participants were tasked with solving three Sudoku puzzles within a time limit of 15 min. The scenarios for each experiment were as follows: • Scenario 1: The participant was left alone in the room, solving Sudoku puzzles while being exposed to horror or discordant audio, such as white noise, and watching horror videos, such as zombie movies. • Scenario 2: No music or videos were played during this scenario. Instead, a person was present in the room observing the participant while they solved the Sudoku puzzles. • Scenario 3: The participant was left alone in the room, solving Sudoku puzzles while being exposed to comforting audio and videos, such as sounds of birds, waterfalls, and rainfall.
The participants were divided into two study groups. The first group consisted of 15 participants who were assigned to solve Sudoku puzzles with a "medium" difficulty level. The second group also had 15 participants, but they were assigned to solve Sudoku puzzles with a "hard" difficulty level. This division was made to investigate the impact of Sudoku difficulty on changes in stress levels. At the end of each trial, participants were asked to indicate their stress levels using the re-designed questionnaire. This questionnaire was designed using Wenjuanxing and was based on the self-report approach proposed by Li et al. [15]. The questionnaire was adapted and revised for this study. The stress levels were assessed on a 3-point scale, where scores of 0 to 4 indicated relaxation or little stress, scores of 5 to 7 indicated medium stress, and scores of 8 indicated high stress. In order to enhance the level of stress and motivation during the Sudoku puzzle completion, incentives in the form of prizes were provided to the participants. The participants had the opportunity to win a prize for each Sudoku puzzle if they were able to successfully complete it within the given time limit of 15 min. Additionally, if a participant successfully completed all the Sudoku puzzles within the designated time frame, they were eligible to receive an additional prize. In total, there were four prizes available to be won.

Data Collection
The data collection process involved using three separate healthcare devices to gather PPG, ECG, and EEG data from the participants, as shown in Figure 1. PPG data were collected using the Polar Verity Sense (PVS) [19], ECG data were collected using the BMD101 device [20], and EEG data were collected using NeuroSky's MindWave Mobile 2 (MV2) [21]. The sampling rates for data collection were set at 55 Hz for PVS, 512 Hz for BMD101, and 512 Hz for MV2. It is important to note that data collection did not occur during the participant's self-reporting periods.
completion, incentives in the form of prizes were provided to the participants. The participants had the opportunity to win a prize for each Sudoku puzzle if they were able to successfully complete it within the given time limit of 15 min. Additionally, if a participant successfully completed all the Sudoku puzzles within the designated time frame, they were eligible to receive an additional prize. In total, there were four prizes available to be won.

Data Collection
The data collection process involved using three separate healthcare devices to gather PPG, ECG, and EEG data from the participants, as shown in Figure 1. PPG data were collected using the Polar Verity Sense (PVS) [19], ECG data were collected using the BMD101 device [20], and EEG data were collected using NeuroSky's MindWave Mobile 2 (MV2) [21]. The sampling rates for data collection were set at 55 Hz for PVS, 512 Hz for BMD101, and 512 Hz for MV2. It is important to note that data collection did not occur during the participant's self-reporting periods. To capture the signals emitted by BMD101, PVS, and MV2 devices, three distinct programs were employed. The BMD101 package [20] offered a starting program designed specifically for acquiring the ECG signal transmitted by BMD101. Additionally, PVS provided an SDK example for capturing the PPG signal [20]. To collect the EEG signals transmitted by MV2, a ThinkGear socket [21] developed with Node.js on GitHub was utilized. The program for receiving ECG and EEG signals was executed on Windows 11, while the program for capturing the PPG signal ran on an Android platform. All of these programs recorded the raw data in text files. To capture the signals emitted by BMD101, PVS, and MV2 devices, three distinct programs were employed. The BMD101 package [20] offered a starting program designed specifically for acquiring the ECG signal transmitted by BMD101. Additionally, PVS provided an SDK example for capturing the PPG signal [20]. To collect the EEG signals transmitted by MV2, a ThinkGear socket [21] developed with Node.js on GitHub was utilized. The program for receiving ECG and EEG signals was executed on Windows 11, while the program for capturing the PPG signal ran on an Android platform. All of these programs recorded the raw data in text files.

Data Pre-Processing
The raw data obtained from the experiment had varying sampling rates, which posed challenges when inputting the data into the classification models without preprocessing. Additionally, there were certain inevitable recording errors during the data collection process. For example, the start time of PPG data collection may have been slightly delayed or advanced compared to the other two types of data, resulting in a few missing data points. Furthermore, there were uncontrollable variables in the experiment, such as instances where participants completed the Sudoku puzzle in less than 15 min, requiring the experiment to be ended earlier. To address these issues, all the data were resampled to a uniform rate of 256 Hz using a resampling method. This resulted in a total of 230,400 data points for each experiment. Subsequently, a Butterworth band-pass filter was applied to remove noise from the data. For PPG data, a low-cut frequency of 0.5 Hz and a high-cut frequency of 5 Hz were utilized. The raw PPG1, PPG2, and PPG3 signals corresponded to the sensor's green, red, and infrared light sources, respectively. For ECG data, the low-cut and high-cut frequencies were set at 5 Hz and 15 Hz, respectively. As for EEG data, a low-cut frequency of 0.1 Hz and a high-cut frequency of 15 Hz were employed. For the segmentation step, a sliding window approach with a duration of 10 s and no overlap was employed on the 15 min raw data. This segmentation divided the original 230,400 data points into 90 segments, with each segment containing 2560 data points. Figure 2 depicts the unprocessed and filtered PPG, ECG, and EEG data. The training pipeline for a model is outlined in Figure 3. On the other hand, Figure 4 illustrates the process of incorporating a collected sample of ECG data, indicating the various steps involved. The raw data obtained from the experiment had varying sampling rates, which posed challenges when inputting the data into the classification models without preprocessing. Additionally, there were certain inevitable recording errors during the data collection process. For example, the start time of PPG data collection may have been slightly delayed or advanced compared to the other two types of data, resulting in a few missing data points. Furthermore, there were uncontrollable variables in the experiment, such as instances where participants completed the Sudoku puzzle in less than 15 min, requiring the experiment to be ended earlier. To address these issues, all the data were resampled to a uniform rate of 256 Hz using a resampling method. This resulted in a total of 230,400 data points for each experiment. Subsequently, a Butterworth band-pass filter was applied to remove noise from the data. For PPG data, a low-cut frequency of 0.5 Hz and a high-cut frequency of 5 Hz were utilized. The raw PPG1, PPG2, and PPG3 signals corresponded to the sensor's green, red, and infrared light sources, respectively. For ECG data, the low-cut and high-cut frequencies were set at 5 Hz and 15 Hz, respectively. As for EEG data, a lowcut frequency of 0.1 Hz and a high-cut frequency of 15 Hz were employed. For the segmentation step, a sliding window approach with a duration of 10 s and no overlap was employed on the 15 min raw data. This segmentation divided the original 230,400 data points into 90 segments, with each segment containing 2560 data points. Figure 2 depicts the unprocessed and filtered PPG, ECG, and EEG data. The training pipeline for a model is outlined in Figure 3. On the other hand, Figure 4 illustrates the process of incorporating a collected sample of ECG data, indicating the various steps involved.    (e) (f) Figure 2. Collected (a) raw PPG, (b) filtered PPG, (c) raw ECG, (d) filtered ECG, (e) raw EEG, and (f) filtered EEG data in this study. The PPG1, PPG2, and PPG3 signals represent the green, red, and infrared light captured by the sensor, respectively. Following the application of filtering techniques, both motion-related noise in PPG and noise in ECG were effectively reduced. However, there were no discernible changes observed in the EEG signal.  (e) (f) Figure 2. Collected (a) raw PPG, (b) filtered PPG, (c) raw ECG, (d) filtered ECG, (e) raw EEG (f) filtered EEG data in this study. The PPG1, PPG2, and PPG3 signals represent the green, re infrared light captured by the sensor, respectively. Following the application of filtering techn both motion-related noise in PPG and noise in ECG were effectively reduced. However, there no discernible changes observed in the EEG signal.

StressNeXt
In the StressNeXt model [22], the parameter quantity is initially reduced using a 1 convolutional block. The reduced parameters are then passed through four consecut multi-kernel blocks. Each multi-kernel block comprises multiple convolutional lay with different sizes, namely 1 × 1, 1 × 3, 1 × 5, and 1 × 7. Within the multi-kernel block, input signal is simultaneously processed by these convolutional layers, and instead concatenating the results at the end, similar to the Inception network [23], the feat maps obtained from these layers are combined by addition. The specific architecture the StressNeXt model is depicted in Figure 5.

StressNeXt
In the StressNeXt model [22], the parameter quantity is initially reduced using a 1 × 1 convolutional block. The reduced parameters are then passed through four consecutive multi-kernel blocks. Each multi-kernel block comprises multiple convolutional layers with different sizes, namely 1 × 1, 1 × 3, 1 × 5, and 1 × 7. Within the multi-kernel block, the input signal is simultaneously processed by these convolutional layers, and instead of concatenating the results at the end, similar to the Inception network [23], the feature maps obtained from these layers are combined by addition. The specific architecture of the StressNeXt model is depicted in Figure 5.

StressNeXt
In the StressNeXt model [22], the parameter quantity is initially reduced using a 1 × 1 convolutional block. The reduced parameters are then passed through four consecutive multi-kernel blocks. Each multi-kernel block comprises multiple convolutional layers with different sizes, namely 1 × 1, 1 × 3, 1 × 5, and 1 × 7. Within the multi-kernel block, the input signal is simultaneously processed by these convolutional layers, and instead of concatenating the results at the end, similar to the Inception network [23], the feature maps obtained from these layers are combined by addition. The specific architecture of the StressNeXt model is depicted in Figure 5.

LRCN
The attention-LRCN (Long-term Recurrent Convolutional Network) model [14] was originally designed to take the Short-Term Fourier Transform (STFT) of signals as input, which consists of 2 dimensions corresponding to frequency and time. However, in this study, only 1 dimension is used as input. Therefore, the baseline of the attention-LRCN was taken and adapted to perform classification tasks on 1D input signals from the dataset. The model begins with two 1D convolutional layers and a max pooling layer to reduce computation complexity. Next, two residual blocks are applied, each separated by a max pooling layer that further reduces the feature maps. After the residual blocks, two LSTM layers are used to extract time-domain features from the signals and perform the classification task. Figure 6 shows the architecture of the LRCN model.

LRCN
The attention-LRCN (Long-term Recurrent Convolutional Network) model [14] was originally designed to take the Short-Term Fourier Transform (STFT) of signals as input, which consists of 2 dimensions corresponding to frequency and time. However, in this study, only 1 dimension is used as input. Therefore, the baseline of the attention-LRCN was taken and adapted to perform classification tasks on 1D input signals from the dataset. The model begins with two 1D convolutional layers and a max pooling layer to reduce computation complexity. Next, two residual blocks are applied, each separated by a max pooling layer that further reduces the feature maps. After the residual blocks, two LSTM layers are used to extract time-domain features from the signals and perform the classification task. Figure 6 shows the architecture of the LRCN model. Figure 6. The LRCN architecture consists of 1D convolutional layers followed by ReLU activation.

Self-Supervised CNN
The process of self-supervised learning consists of two stages as illustrated in Figure  7. In the first stage, known as pretraining, a self-supervised CNN [9] is trained to recognize whether an input is a transformed version of the original signals. Various transformations are applied to the signals, including adding noise, scaling, negating, horizontally flipping, and permuting the signal. However, in this study, the time-warped transformation was

Self-Supervised CNN
The process of self-supervised learning consists of two stages as illustrated in Figure 7. In the first stage, known as pretraining, a self-supervised CNN [9] is trained to recognize whether an input is a transformed version of the original signals. Various transformations are applied to the signals, including adding noise, scaling, negating, horizontally flipping, and permuting the signal. However, in this study, the time-warped transformation was excluded. Before applying these transformations, the data are segmented and normalized, as shown in Figure 4. For example, Gaussian noise with parameters µ = 0 and σ = 0.01 is added to the signal, the signal is scaled by a factor of 1.1, and signal permutation involves randomly selecting 20 pieces of length 1/4 of the sampling rate (64 data points) and swapping them to create a permuted version of the original signal.

Self-Supervised CNN
The process of self-supervised learning consists of two stages as illustrated in Fi 7. In the first stage, known as pretraining, a self-supervised CNN [9] is trained to recog whether an input is a transformed version of the original signals. Various transforma are applied to the signals, including adding noise, scaling, negating, horizontally flipp and permuting the signal. However, in this study, the time-warped transformation excluded. Before applying these transformations, the data are segmented and normal as shown in Figure 4. For example, Gaussian noise with parameters = 0 and = is added to the signal, the signal is scaled by a factor of 1.1, and signal permutatio volves randomly selecting 20 pieces of length 1/4 of the sampling rate (64 data points swapping them to create a permuted version of the original signal. Figure 8    The shared layer of the self-supervised CNN is composed of three convolutional blocks. Each block consists of two 1D convolutional layers and is followed by a max pooling operation. In the first convolutional block, there are 32 filters with a filter size of 32. The second block has 64 filters with a size of 16, and the third block uses 128 filters with a size of 8. The max pooling layers in each block have a size of 8 and a stride of 2. After passing through the shared layer, the feature maps have dimensions of 128 × 605. These feature maps are then inputted into a global max pooling layer, resulting in an output size of 128 × 1.

Training Parameters
The models' performance was evaluated using k-fold stratified cross-validation. This approach ensures that each split contains all class labels rather than having one-fold with labels from only one class. The value of k was set to 3 for this training. The optimizer used was Adam, with a learning rate of 0.001. During the training of the models, the total number of epochs was set to 300, while for the self-supervised CNN pretraining stage, it was set to 150 epochs. The validation accuracy of each epoch and F1-score were averaged and returned as the performance for each fold. After completing the cross-validation, the performance of each fold was averaged again to obtain the final model performance.
The models were developed using the PyTorch framework and executed on an HP Shadow Elf 7 laptop running Windows 11. The laptop was equipped with an NVIDIA GeForce RTX 3060 Laptop GPU, an Intel(R) Core (TM) i7-11800H CPU operating at 2. The shared layer of the self-supervised CNN is composed of three convolutional blocks. Each block consists of two 1D convolutional layers and is followed by a max pooling operation. In the first convolutional block, there are 32 filters with a filter size of 32. The second block has 64 filters with a size of 16, and the third block uses 128 filters with a size of 8. The max pooling layers in each block have a size of 8 and a stride of 2. After passing through the shared layer, the feature maps have dimensions of 128 × 605. These feature maps are then inputted into a global max pooling layer, resulting in an output size of 128 × 1.

Training Parameters
The models' performance was evaluated using k-fold stratified cross-validation. This approach ensures that each split contains all class labels rather than having one-fold with labels from only one class. The value of k was set to 3 for this training. The optimizer used

Results and Discussion
Various signal combinations, such as PPG, ECG, EEG, PPG + ECG, PPG + EEG, ECG + EEG, and PPG + ECG + EEG, were explored to determine the most effective signals for stress detection. The performance of the models was evaluated using accuracy and F1-score as metrics. Accuracy represents the percentage of correctly classified data in relation to the entire dataset and serves as a straightforward measure of performance which is represented as follows: where p i is the predicted label of the ith sample, and y i is the true label of the ith sample. If p i is the same as y i , p i == y i , it is 1, otherwise, it is 0. F1-score can be interpreted as a harmonic means of precision and recall, where precision assesses the classifier's capacity to avoid mislabeling negative samples as positive, while recall evaluates the classifier's ability to identify all positive samples, with the equations as follows: where TP is true positive, FP is false positive, and FN is false negative.

Scenario-Based Self-Reporting Stress Analysis
Figures 9 and 10 presented the changes in stress levels across different scenarios while participants solved "medium" and "hard" levels of Sudoku puzzles, respectively. The results indicated that a majority of participants experienced an increase in stress levels in scenario 1, which involved noisy and horror audio/videos. About half of the participants (13 out of 30) reported higher stress levels in scenario 2, where they were monitored by another person. Conversely, in scenario 3, which included comforting audio and video, most participants (22 out of 30) reported a decrease in stress levels. When comparing scenario 3 to scenario 1, 20 out of 30 participants demonstrated a decrease in stress levels in the final scenario. Additionally, in scenario 1, participants solving the "hard" Sudoku puzzles tended to experience higher levels of stress compared to those solving the "medium" Sudoku puzzles. Similarly, in scenario 3, participants solving the "medium" Sudoku puzzles showed a reduction in stress levels compared to those solving the "hard" Sudoku puzzles.  Table 2 displays the accuracy and F1-score results for different models and combinations of biosignals used in the classification. The results revealed that the LRCN model demonstrated the highest effectiveness in predicting stress levels when utilizing ECG data, achieving an average accuracy of 93.42% and an F1-score of 88.11%. Other models also exhibited high levels of accuracy and F1-score, with StressNeXt and self-supervised CNN achieving the best performance when combining ECG and EEG signals. Notably, the utilization of EEG alone resulted in the lowest accuracy and F1-score across all models.  Table 2 displays the accuracy and F1-score results for different models and combinations of biosignals used in the classification. The results revealed that the LRCN model demonstrated the highest effectiveness in predicting stress levels when utilizing ECG data, achieving an average accuracy of 93.42% and an F1-score of 88.11%. Other models also exhibited high levels of accuracy and F1-score, with StressNeXt and self-supervised CNN achieving the best performance when combining ECG and EEG signals. Notably, the utilization of EEG alone resulted in the lowest accuracy and F1-score across all models. This observation could potentially be attributed to participants' suboptimal reduction of movements during the experiments, as proposed by [24,25]. The presence of muscle noise might have compromised the effectiveness of the collected EEG data in this study.  Table 3 displays the accuracy and F1-score outcomes for the various model and biosignal combinations utilized in the stress level classification under different experimental scenarios. The findings suggest that the StressNeXt model was most successful in predicting stress levels in scenario 3, while the LRCN model was the most effective in scenarios 1 and 2. ECG was found to be the most effective signal in analyzing stress levels, with all models achieving around 95% accuracy and 90% F1-score when utilizing ECG data. On the other hand, EEG was found to be the least effective signal for detecting stress levels, with all models yielding the lowest accuracies and F1-scores, less than 70% in scenarios 1 and 2 and 90% in scenario 3. In addition, it can be observed that all models were able to predict stress levels rather effectively in scenario 3, with accuracy and F1-score reaching approximately 90% for all signal combinations. Table 4 displays a comparison of the difficulty levels of Sudoku for each model. The accuracy rates and F1-scores of the StressNeXt model were slightly higher in the "medium" level Sudoku sample group compared to the "hard" level Sudoku sample group. Conversely, the other two models exhibited slightly lower accuracy rates and F1 scores in the "medium" level Sudoku sample group compared to the "hard" level Sudoku sample group. However, these differences were not found to be statistically significant.

Classifiers Evaluation
The confusion matrices for StressNeXt, LRCN, and self-supervised CNN with different input signals for all scenarios and difficulty levels of Sudokus are displayed in Figure 11, Figure 12, and Figure 13, respectively. The total number of samples after data segmentation was 8100, with 6030 samples for stress level class 0, 1710 samples for class 1, and 360 samples for class 2. The confusion matrices were obtained by averaging the confusion matrix of the validation set in each fold of the 3-fold validation. It is apparent that there is more training data on class 0, and the model seems to lean towards predicting class 0 stress level. StressNeXt was unable to predict class 2 stress levels with ECG, whereas other models can effectively classify most of the class 2 stress using only ECG. The self-supervised CNN failed to distinguish other classes from class 0 with EEG or all signals combined as input and was incapable of classifying any class 2 stress with combined PPG and ECG as input.  (a) (b) Figure 10. (a) Relationship between stress levels and scenarios, as well as the (b) distrib stress levels for each participant in the hard level Sudoku where the legend represented the ios.
(e) (f) (g) Figure 11. StressNeXt confusion matrix for (a) PPG as input, (b) ECG as inpu (c) EEG as input, (d) combined PPG and ECG as input, (e) combined PPG an EEG as input, (f) combined ECG and EEG as input, (g) combined PPG, ECG and EEG as input.   Figure 13. Self-supervised CNN confusion matrix for (a) PPG as input, (b) ECG as input, (c) input, (d) combined PPG and ECG as input, (e) combined PPG and EEG as input, (f) combine and EEG as input, (g) combined PPG, ECG, and EEG as input. Figure 13. Self-supervised CNN confusion matrix for (a) PPG as input, (b) ECG as input, (c) EEG as input, (d) combined PPG and ECG as input, (e) combined PPG and EEG as input, (f) combined ECG and EEG as input, (g) combined PPG, ECG, and EEG as input. Table 5 illustrates a performance comparison between our proposed model and previous studies on stress detection. The results indicate that our model outperforms others in terms of accuracy in scenario 3, where subjects solve Sudoku under comforting conditions. Similarly, in scenario 2, where subjects solve Sudoku with another individual monitoring, our model achieves the highest F1-score compared to previous models. Overall, our model demonstrates superior accuracy compared to Transformer [26], Random Forest [27], AdaBoost DT [17], DeepER Net [28], Artificial Neural Network [29], Deep ECGNet [30], and CNN-LSTM [31]. The comparison highlights that the effectiveness of these models may be specific to particular datasets, depending on the scenario design and experiment settings. Despite the fact that the study conducted by [12], which utilized Deep 1D-CNN, exhibited superior performance compared to our proposed work, it required more than five sensors as inputs, which may not be feasible for real-world applications. Furthermore, the evaluation of individual sensor data on the performance of our proposed model was not conducted, and the sample size only included 15 participants, which did not adequately represent the majority of human stress levels. Similarly, the study using SVM-RBF [25] demonstrated slightly higher accuracy than our proposed work. However, it is worth noting that our proposed work achieved higher accuracy (93.42%) when utilizing only EEG signals, compared to [13], which utilized PPG (80%) and GSR + PPG (86.25%) signals. Perfect comparison becomes relatively challenging as our current study did not incorporate the GSR signal, as its changes vary based on environmental conditions. In general, most studies included fewer than 20 participants, and variations in task creation and labeling methods likely contribute to performance differences. Nonetheless, our study attains the highest accuracy (98.78%) in scenario 3 and the highest F1-score (96.67%) in scenario 2, surpassing the existing work.

Conclusions
In this study, a novel experimental approach was used to evaluate stress levels in participants through the use of Sudoku puzzles as problem-solving tasks. The study utilized PPG, ECG, and EEG data to extract features and assess stress levels. Various scenarios were created using audio and video components to examine the impact of environmental stimuli on stress levels. The results indicated that noisy environments tended to cause higher stress levels, and higher difficulty level Sudoku puzzles may lead to a higher average stress level. The stress detection models used in the experiments showed high effectiveness, with the LRCN model achieving a stress level detection accuracy of 93.42% and an F1-score of 88.11% when ECG was used as an input signal. When analyzing different scenarios, the StressNeXt model demonstrated exceptional effectiveness in predicting stress levels under comforting conditions, achieving an accuracy rate of 98.78% and an F1-score of 95.39%. Conversely, the LRCN model was most effective in predicting stress levels in two scenarios, achieving accuracy rates of 95.13% and 97.96%, with F1 scores of 93.72% and 96.67%, respectively. However, the study has some limitations, such as the subjectivity of self-reporting, which can result in variations in reported stress levels. To improve the reliability of this study, more robust self-reporting methods, such as using the Self-Assessment Manikin (SAM), can be employed to validate the accuracy and reliability of stress level assessment. In addition, this study aims to explore the potential inclusion of other physiological signals, such as GSR or respiratory rate, to examine their effectiveness in stress detection. Furthermore, the study plans to incorporate another dataset containing diverse scenarios and environments to validate the performance of the proposed deep learning models.