1. Introduction
Scientific and technological advancements have enabled more effective solutions in fields such as human–machine interaction, pattern recognition, and biological signal processing [
1]. In particular, the intersection of machine learning with behavioral analysis and health informatics has made it increasingly important to detect individuals’ psychophysiological states automatically. In this context, the classification of stress through physiological signals has become a multidisciplinary research area [
2]. Stress is a common physiological response indicating an individual’s exposure to excessive demands or pressure. While short-term stress can be functional for adaptation, chronic stress is known to have long-lasting and detrimental effects on the organism. It can affect various biological systems, including the musculoskeletal, cardiovascular, and respiratory systems.
In healthcare environments, stress becomes even more critical. Nursing is inherently a high-stress profession, with factors such as heavy workloads, long shifts, high patient turnover, and emotional demands all contributing to the increased risk of burnout among nurses. The COVID-19 pandemic further highlighted the challenges nurses face in managing stress, leading to widespread burnout [
3,
4]. Stress not only affects the mental and physical well-being of nurses but also has a direct impact on patient safety, care quality, and the sustainability of healthcare services.
Therefore, the accurate and continuous monitoring of stress is vital not only for improving individual health and worker well-being but also for maintaining patient safety and sustainable healthcare delivery. Traditional self-report-based stress assessment methods are susceptible to subjective bias, resulting in a growing need for objective, data-driven approaches that utilize physiological signals. Wearable sensors provide a more reliable reflection of an individual’s real-time physiological condition, enabling analysis through machine learning algorithms. Parameters such as electrodermal activity (EDA), heart rate (HR), and skin temperature (TEMP) carry valuable bio-physiological indicators of stress. Furthermore, galvanic skin response, heart rate variability, and peripheral blood flow offer direct insights into stress levels via the autonomic nervous system [
5,
6].
The key physiological signals commonly used in stress detection are summarized below:
Galvanic skin response (GSR or EDA): Measures skin conductance changes due to perspiration. Strongly linked to sympathetic nervous system activation during stress.
Electrocardiogram (ECG) and heart rate variability (HRV): HRV indicates variability between heartbeats; lower HRV is associated with poor stress resilience.
Blood volume pulse (BVP): Measures changes in peripheral blood flow, reflecting autonomic nervous system activity associated with stress.
Skin temperature (TEMP): Can vary due to changes in thermoregulation under stress.
Accelerometer data (X, Y, Z): Indicates physical movement or inactivity, which may serve as indirect indicators of stress [
6].
This study aims to classify stress levels using biosignal data collected via the Empatica E4 wearable device. The dataset was collected from 15 female nurses working in a hospital setting, comprising two phases: Phase I (15 April–6 August 2020) and Phase II (8 October–11 December 2020). Comprising over 11.5 million time-series records, the dataset includes electrodermal activity (EDA), heart rate (HR), skin temperature (TEMP), and tri-axial accelerometer signals (X, Y, Z). Each record is labeled into one of three classes: 0 (low stress), 1 (moderate stress), and 2 (high stress). Notably, the dataset suffers from a significant class imbalance, with the high-stress class (label = 2.0) constituting approximately 74% of all records. To mitigate this issue, the Synthetic Minority Over-sampling Technique (SMOTE) was applied for data balancing.
The primary objective of this study is to classify nurse stress levels using physiological sensor data and to compare the performance of several machine learning algorithms, including Random Forest, XGBoost, k-Nearest Neighbor (k-NN), and LightGBM. Additionally, temporal analysis was conducted to examine stress patterns across time (by day and hour), evaluating the applicability of real-time stress monitoring systems in healthcare settings.
2. Related Work
Kabiruzzaman et al. [
7] conducted a time-series analysis of real-world data provided as part of the “Fourth Nurse Activity Recognition Challenge.” Using time-based features extracted from care records, four classification algorithms were implemented, with the Decision Tree model achieving the highest accuracy (66%). Nevertheless, the Random Forest classifier demonstrated a superior performance for three out of five users, indicating its potential for nurse activity recognition through temporal data.
Eldien et al. [
5] explored various fusion strategies to enhance the accuracy of stress prediction in nurses using multimodal sensor data, including heart rate (HR), electrodermal activity (EDA), skin temperature, and accelerometer-derived location information. The study compared data-level, model-level, and prediction-level fusion strategies, with prediction-level fusion achieving the highest accuracy, outperforming model fusion by 1.97% and data fusion by 1.26%. This research underscores the potential of fusion techniques in capturing the multifaceted nature of stress, contributing to the development of robust stress prediction systems in healthcare settings.
Pasha et al. [
8] proposed a multi-class classification approach for detecting nurses’ stress levels (no stress, low stress, high stress) using physiological data collected via the Empatica E4 device (Empatica Inc., Milan, Italy). The study utilized preprocessed signals—heart rate (HR), electrodermal activity (EDA), and skin temperature (ST)—and developed both a Bidirectional LSTM with Attention Mechanism (BiLSTM-AM) and a stacking-based ensemble model combining DT, RF, XGBoost, MLP, and LR. The BiLSTM-AM model, implemented in Python with TensorFlow and Keras on Google Colaboratory, achieved 96% accuracy, while the ensemble model reached 97%. However, the exact software and library versions were not reported, which may limit reproducibility. Model performance was evaluated using precision, recall, F1-score, ROC curves, and ablation studies, highlighting the practical potential of AI-supported real-time stress monitoring through wearable devices.
Quadrini, Falcone, and Gerard [
9] conducted a comparative analysis of machine learning algorithms for stress detection using physiological signals from wearable sensors. Using the WESAD dataset, which includes BVP, ECG, and EMG signals, the study avoided feature engineering by applying models directly to raw signal fragments. Ten algorithms from tree-based, ensemble, linear, and neighbor-based model families were tested under both binary (stress/no-stress) and multi-class (baseline, stress, and amusement) classification scenarios. Random Forest consistently outperformed other models across both classification tasks, demonstrating the feasibility of ML-based approaches in stress recognition.
Alrosan et al. [
10] examined nurse stress detection using machine learning, emphasizing the importance of personalized and adaptive models due to individual variability in stress responses. Physiological, behavioral, and environmental data collected via wearable devices enabled the real-time monitoring of stress. Meta-heuristic algorithms were employed for hyperparameter optimization, leading to a Random Forest model that achieved 96.63% accuracy and a 95.76% F1-score. The findings validate the effectiveness of optimized ML models in stress classification.
Jain et al. [
11] evaluated the application of machine learning models to mental health assessment using an extensive dataset that incorporated demographic, lifestyle, and behavioral attributes. Among the compared models, boosting yielded the highest accuracy (81.75%), outperforming decision tree (80.69%) and logistic regression (79.63%). Feature engineering techniques improved interpretability and model performance, while cross-validation ensured robustness. This work highlights the relevance of AI models in identifying mental health risks and emphasizes the link between lifestyle factors and stress.
Kang, Kwon, and Lee [
12] investigated the impact of patient safety incidents on nurses’ work–life balance (WLB) using classification and regression tree (CART) analysis. Drawing from a sample of 372 nurses, the study incorporated variables such as education, marital status, position, physical distress, second-victim support, turnover intention, and absenteeism. Key findings revealed that lower physical distress, fewer turnover intentions, and limited second-victim support correlated with higher WLB scores. The study provides actionable insights for mitigating occupational stress and fostering a supportive organizational culture in healthcare environments.
Razavi et al. [
13] conducted a comprehensive scoping review on machine learning and deep learning applications for detecting and monitoring stress and related mental disorders. Evaluating 98 studies, they identified support vector machines, neural networks, and Random Forest as top-performing algorithms. Physiological signals such as heart rate and skin conductance emerged as dominant predictors. The review emphasized the importance of preprocessing methods, including dimensionality reduction and noise filtering. It highlighted future research directions related to model interpretability, personalization, and real-time deployment in naturalistic settings.
3. Method
As illustrated in
Figure 1, the proposed machine learning-based stress prediction workflow consists of several sequential stages: data preprocessing, resampling into one-minute intervals, Synthetic Minority Over-sampling Technique (SMOTE) application for class balancing, feature selection (including electrodermal activity, heart rate, skin temperature, and accelerometer signals), model training using four supervised algorithms (Random Forest, XGBoost, k-Nearest Neighbors, and LightGBM), hyperparameter optimization, and final model evaluation based on accuracy, F1-score, and confusion matrices. This block diagram provides a clear and structured overview of the methodological pipeline implemented in this study.
Analyses were performed in Google Colaboratory (Python 3.12.11; OS: Linux 6.1.123+ x86_64). The key library versions were as follows: scikit-learn 1.6.1 (for Random Forest and k-Nearest Neighbors), XGBoost 3.0.4, LightGBM 4.6.0, and imbalanced-learn 0.14.0 (for SMOTE). Additional libraries included NumPy 2.0.2 and pandas 2.2.2. Providing these version details ensures reproducibility and aligns with best practices in reporting computational experiments.
3.1. Dataset
In this study, the “Nurse Stress Prediction using Wearable Sensors” dataset, publicly available on the Kaggle platform and provided by Priyank Raval, was utilized to classify occupational stress levels among nurses [
14]. The dataset comprises time-series physiological data collected via Empatica E4 wearable devices during the routine clinical activities of 15 female nurses in a hospital setting, across two distinct periods: Phase I (15 April–6 August 2020) and Phase II (8 October–11 December 2020).
The dataset includes measurements such as triaxial accelerometer data (X, Y, Z), electrodermal activity (EDA), heart rate (HR), skin temperature (TEMP), nurse identification code (id), timestamp (datetime), and stress level labels (label). Stress levels are categorized into three classes: 0 (low), 1 (moderate), and 2 (high). Summary statistics of the dataset are presented in
Table 1.
As shown in
Table 1, the accelerometer values exhibit varying distributions across the X, Y, and Z axes. Electrodermal activity (EDA) values range from 0.009 to 54.68, with an average of 3.48. Heart rate (HR) measurements appear inconsistent, with a reported minimum of 524.53 and a maximum of 170.14, suggesting a potential data anomaly.
3.2. Data Preprocessing
As part of the preprocessing steps, the raw time-series data were aggregated into one-minute intervals for each nurse, with average values computed for each window. This procedure was employed to reduce noise and generate a more analyzable dataset structure. The dominant stress level for each window was assigned using the mode calculation. As a result of this resampling process, the data volume was significantly reduced, and class distributions were altered. Before preprocessing, the high-stress class (label = 2.0) was dominant, accounting for 8,540,583 samples, whereas the low-stress (0.0) and medium-stress (1.0) classes contained 2,162,246 and 806,222 samples, respectively. After preprocessing, the number of samples decreased to 4572 for the high-stress class, 1165 for the low-stress class, and 419 for the medium-stress class (
Figure 2). While this reduction was anticipated and yielded a more manageable and balanced dataset, the class imbalance remained, with the high-stress class still prevailing. This highlighted the necessity of addressing class imbalance in subsequent modeling stages.
Following this, six physiological features (X, Y, Z, EDA, HR, and TEMP) were selected as independent variables for model development, and missing data were assessed. Missing values were imputed using the mean of each column to avoid data loss. Given the different measurement scales across features, z-score normalization was applied, and scaling was performed using the StandardScaler from scikit-learn (version 1.6.1). Since class imbalance persisted after resampling, the Synthetic Minority Over-sampling Technique (SMOTE) was employed to balance the class distribution. Only valid (non-NaN) samples were included in this process. The application of SMOTE enhanced the model’s capacity to learn from minority classes and minimized inter-class bias, thereby improving the overall classification performance.
3.3. Modeling and Classification Methods
In this study, four supervised learning algorithms were implemented to classify nurses’ stress levels using physiological data: Random Forest (RF), XGBoost, k-Nearest Neighbors (k-NN), and LightGBM.
To enhance model performance, hyperparameter optimization was conducted for the Random Forest and XGBoost classifiers. The RandomizedSearchCV technique was employed to explore predefined parameter ranges and determine optimal configurations. For Random Forest, parameters such as n_estimators, max_depth, min_samples_split, min_samples_leaf, and bootstrap were tuned. For XGBoost, the optimized parameters included n_estimators, max_depth, learning_rate, subsample, colsample_bytree, gamma, and reg_lambda. The models were retrained using these optimized settings and prepared for evaluation.
During the model development phase, the class-balanced dataset was split into training (80%) and testing (20%) subsets to ensure reliable performance assessment.
3.4. Evaluation Metrics
To objectively assess the performance of the developed classification models, this study employed evaluation metrics commonly used in multi-class classification problems. Accordingly, the models were evaluated based on accuracy, F1-score, confusion matrix, and classification report. These metrics not only provide an overall measure of model performance but also allow for the assessment of discriminative power, particularly for minority classes in imbalanced datasets [
15].
Accuracy: Represents the ratio of correctly predicted instances to the total number of cases. While it offers a general summary of model performance, it may be misleading in datasets with class imbalance and is therefore insufficient when used in isolation.
F1-score: The harmonic mean of precision and recall. It is particularly important for evaluating the model’s ability to correctly classify minority classes. In this study, the weighted F1-score was employed, with each class’s contribution weighted proportionally to its sample size.
Confusion matrix: A matrix that shows the model’s predictions versus the actual classes, providing detailed insights into which class pairs are most frequently misclassified.
Classification report: Summarizes the precision, recall, and F1-score for each class individually. This report was particularly useful for evaluating model performance across the three stress levels (0: low, 1: medium, 2: high).
The combined use of these metrics enabled a comprehensive evaluation of the models, not only in terms of overall accuracy but also in their ability to accurately distinguish between varying levels of stress. Detailed results for each metric per model are presented in the Results Section.
4. Results
Figure 3 illustrates the average stress levels of nurses across the days of the week. The graph reveals notable variations in stress distribution throughout the week. The highest average stress level was observed on Sunday, followed by Wednesday and Monday. This pattern suggests that stress tends to escalate at both the beginning and end of the week, with a peak in the middle of the week. Particularly, the fact that Wednesday’s average stress level exceeds 1.8 may indicate an increased workload during weekdays.
In contrast, Tuesday shows the lowest average stress level, while Thursday and Friday exhibit relatively balanced stress profiles. A renewed increase in stress levels toward the weekend may be attributed to factors such as shift transitions, weekend workload surges, or accumulated responsibilities from the previous week. These findings highlight that stress is influenced not only by individual factors but also by temporal organizational dynamics, offering valuable insights for optimizing weekly scheduling in healthcare settings.
Figure 4 presents the average stress levels of nurses across different hours of the day, revealing significant temporal fluctuations. Notably, between 00:00 and 11:00, the average stress level remains close to 2.0, indicating that nurses experience heightened stress during the early hours. This elevated stress may be linked to factors such as fatigue following night shifts, morning patient handovers, or early reporting responsibilities.
From noon onward (post 12:00), a noticeable decrease in stress levels is observed. During the 13:00–17:00 interval, stress levels fluctuate, but a more consistent decline becomes evident after 18:00. After 20:00, stress levels fall below 1.0, reaching their lowest point of the day around 23:00—likely reflecting reduced workloads or shift transitions.
These findings indicate that nurses’ occupational stress varies not only from week to week but also on an hourly basis. This underlines the need for time-sensitive strategies in institutional practices such as stress management and shift scheduling.
As shown in
Table 2, Random Forest outperformed the other classifiers, achieving the highest accuracy (0.91) with balanced precision, recall, and F1-scores across all classes. XGBoost followed with an accuracy of 0.88 and stable performance across labels. In contrast, k-NN and LightGBM reached lower accuracies (0.85), particularly struggling to identify high stress (class 2), where their F1-scores dropped to 0.73 and 0.80, respectively. These results suggest that tree-based ensemble methods are more effective than distance-based or gradient-boosting alternatives in this stress detection task.
Figure 5 illustrates the confusion matrices for the four classification algorithms, providing detailed insights into how accurately each model predicted the three stress level classes (0: low, 1: medium, 2: high) and where misclassifications occurred.
As shown in
Figure 5a, the k-NN model exhibited notably low sensitivity for the high-stress class (label 2), frequently misclassifying it as low stress (label 0), indicating a limited ability to distinguish elevated stress levels. In
Figure 5b, the LightGBM model demonstrated a more balanced performance but struggled particularly with the low-stress class, misclassifying 99 instances as high stress (label 2).
Figure 5c presents the Random Forest model, which achieved a high accuracy in distinguishing both medium and high stress levels. It demonstrated a strong performance in classifying class 1 (medium stress) correctly, while minimizing confusion between classes 0 and 2, indicating a robust discriminative capability across all stress levels. Similarly, the XGBoost model in
Figure 5d displayed a strong performance, particularly for the high-stress class, though it also misclassified 99 low-stress instances as high stress.
Overall, the Random Forest and XGBoost models outperformed the others in both accuracy and class discrimination, showing greater reliability, especially in predicting critical stress levels.
Table 3 presents a comparative analysis of the performance metrics for the Random Forest and XGBoost models following hyperparameter optimization. Both models demonstrated significant improvements compared with their baseline configurations. The optimized Random Forest model achieved a balanced and high classification performance across all stress levels. Specifically, it reached F1-scores of 93%, 96%, and 90% for the low-stress (label 0), medium-stress (label 1), and high-stress (label 2) classes, respectively. The overall accuracy of the model was 93%, with macro and weighted average scores also attaining 0.93, indicating consistent and reliable predictions across both frequent and less-frequent classes.
The XGBoost model, after hyperparameter tuning, achieved an accuracy of 90% and demonstrated a strong class separation ability, particularly in the medium-stress class with an F1-score of 94%. However, the model yielded comparatively lower recall and F1-scores for the high-stress class (label 2) than Random Forest, suggesting that while XGBoost excels in distinguishing certain classes, it may be less effective in identifying individuals experiencing higher stress levels.
Overall, the Random Forest model emerged as the best-performing algorithm after hyperparameter optimization, offering superior general accuracy and class-specific performance. Consequently, it was deemed more suitable for real-time stress prediction applications in healthcare settings.
Figure 6 visualizes the contribution levels of input features to stress classification, as determined by the hyperparameter-optimized Random Forest (
Figure 6a) and XGBoost (
Figure 6b) models. Both models identified electrodermal activity (EDA) and skin temperature (TEMP) as the most informative features. This finding supports the reliability of monitoring stress through physiological responses, particularly those directly linked to the autonomic nervous system, such as EDA and body temperature.
In the Random Forest model, EDA and TEMP were followed in importance by Z-axis acceleration and X-axis acceleration. Similarly, the XGBoost model ranked TEMP and EDA as the top features, followed by Z, X, and heart rate (HR). In both models, Y-axis acceleration and HR were found to have relatively lower feature importance scores.
These results indicate that direct physiological signals, such as EDA and TEMP, serve as the most effective predictors of stress levels. At the same time, accelerometer data (X, Y, Z) play a complementary yet secondary role in model decision-making. Accordingly, the use of EDA and temperature signals is deemed critical for enhancing model performance in sensor-based stress prediction systems.
5. Conclusions and Recommendations
This study aimed to classify the stress levels experienced by nurses in clinical settings using physiological time-series data obtained from wearable sensors. The dataset, comprising electrodermal activity (EDA), heart rate (HR), skin temperature (TEMP), and tri-axial accelerometer data (X, Y, Z), was resampled into one-minute intervals. Due to the significant class imbalance observed, the dataset was balanced using the Synthetic Minority Over-sampling Technique (SMOTE). Four supervised learning algorithms—Random Forest, XGBoost, k-NN, and LightGBM—were applied and compared in terms of accuracy, F1-score, and classification performance.
The findings indicated that Random Forest and XGBoost achieved superior classification performance. After hyperparameter optimization, Random Forest emerged as the most successful model with 93% accuracy and high F1-scores across all stress levels, while XGBoost followed closely with 90% accuracy. In contrast, k-NN and LightGBM yielded a relatively lower performance, particularly showing insufficient sensitivity in classifying high-stress instances.
Temporal analyses revealed significant variations in stress levels across both weekly and daily periods. Stress levels were highest on Sundays and Wednesdays, as well as during early morning hours, whereas lower stress levels were observed on Tuesdays and in the evening. These patterns highlight the combined influence of individual circadian rhythms and institutional workload dynamics on stress. Feature importance analyses further demonstrated that EDA and TEMP were the most informative predictors in both Random Forest and XGBoost models, supporting the literature that identifies skin conductance and temperature as reliable physiological biomarkers of stress.
Overall, the results underscore the feasibility of implementing AI-assisted, sensor-based monitoring systems in healthcare institutions to track nurses’ stress levels. Real-time monitoring solutions centered on EDA and TEMP hold significant potential for early stress detection and intervention. Moreover, the observed temporal stress patterns suggest the need for shift scheduling to be adapted accordingly, incorporating preventive measures such as extended breaks, task rotation, and psychosocial support during periods of peak stress. Future studies should expand beyond physiological inputs to develop comprehensive stress prediction models by integrating contextual (e.g., task type, patient load) and environmental variables, thereby enhancing both staff well-being and healthcare service quality.
6. Discussion
This study evaluated the applicability of machine learning-based classification methods using physiological sensor data to objectively monitor nurses’ stress levels in clinical environments. The results demonstrated that the biomarkers associated with the autonomic nervous system—particularly electrodermal activity (EDA), skin temperature (TEMP), and heart rate (HR)—possess strong discriminative power for stress classification. These findings align with prior research that highlights the importance of such physiological signals in stress detection, confirming that wearable technologies offer reliable and cost-effective solutions for continuous stress monitoring in healthcare settings [
4,
6,
11,
12].
Temporal pattern analyses revealed that stress levels are significantly influenced by both individual circadian rhythms and institutional workload dynamics. Specifically, elevated stress levels were observed on Sundays, Wednesdays, and during early morning hours, indicating that shift scheduling and workload distribution are critical intervention points for mitigating stress. Similarly to the findings reported by Kang et al. (2024) [
12], stress emerged not only as an individual psychological response but also as a function of broader organizational processes.
Among the four supervised learning algorithms tested, the Random Forest and XGBoost models, after undergoing hyperparameter optimization, demonstrated the highest classification accuracy and F1-scores. Random Forest exhibited a balanced and robust performance across all stress levels, while XGBoost achieved particularly high performance in classifying moderate stress levels. In contrast, k-Nearest Neighbors (k-NN) and LightGBM models struggled with lower sensitivity, particularly for high-stress instances, indicating that class imbalance and inherent dataset characteristics significantly affect model performance. This is consistent with Alrosan et al. (2024) [
10], who emphasized the role of meta-heuristic optimization in enhancing classification robustness under similar conditions.
Feature importance analyses further reinforced the central role of EDA and TEMP as the most critical predictors in both the Random Forest and XGBoost models. This supports the physiological basis of stress and highlights the potential for integrating such biomarkers into real-time early warning systems in clinical settings. Conversely, the relatively lower importance of accelerometer variables (X, Y, Z) suggests a more indirect relationship between physical movement and acute stress responses. This differentiation is also reflected in the work of Jain et al. [
11], where behavioral and lifestyle features contributed to mental health predictions but were not as decisive as physiological indicators.
Overall, the findings provide strong empirical support for integrating wearable sensor-based, machine-learning-driven stress monitoring systems into nurses’ working environments. These systems hold considerable promise for enhancing the quality of clinical care by enabling timely and data-driven stress interventions. Future research should move beyond solely physiological data to include contextual variables (e.g., job role, shift type, clinical experience), and explore the implementation of advanced deep learning architectures such as Long Short-Term Memory (LSTM) or Gated Recurrent Units (GRUs), which are well-suited for modeling temporal dependencies in physiological time-series data. Such enhancements could significantly improve the accuracy and predictive power of stress classification models, contributing meaningfully to occupational health management in healthcare institutions.