Background/Objectives: This article addresses the challenge of stress detection across diverse contexts. Mental stress is a worldwide concern that substantially affects human health and productivity, rendering it a critical research challenge. Although numerous studies have investigated stress detection through machine learning (ML) techniques,
[...] Read more.
Background/Objectives: This article addresses the challenge of stress detection across diverse contexts. Mental stress is a worldwide concern that substantially affects human health and productivity, rendering it a critical research challenge. Although numerous studies have investigated stress detection through machine learning (ML) techniques, there has been limited research on assessing ML models trained in one context and utilized in another. The objective of ML-based stress detection systems is to create models that generalize across various contexts.
Methods: This study examines the generalizability of ML models employing EEG recordings from two stress-inducing contexts: mental arithmetic evaluation (MAE) and virtual reality (VR) gaming. We present a data collection workflow and publicly release a portion of the dataset. Furthermore, we evaluate classical ML models and their generalizability, offering insights into the influence of training data on model performance, data efficiency, and related expenses. EEG data were acquired leveraging MUSE-S
TM hardware during stressful MAE and VR gaming scenarios. The methodology entailed preprocessing EEG signals using wavelet denoising mother wavelets, assessing individual and aggregated sensor data, and employing three ML models—linear discriminant analysis (LDA), support vector machine (SVM), and K-nearest neighbors (KNN)—for classification purposes.
Results: In Scenario 1, where MAE was employed for training and VR for testing, the TP10 electrode attained an average accuracy of 91.42% across all classifiers and participants, whereas the SVM classifier achieved the highest average accuracy of 95.76% across all participants. In Scenario 2, adopting VR data as the training data and MAE data as the testing data, the maximum average accuracy achieved was 88.05% with the combination of TP10, AF8, and TP9 electrodes across all classifiers and participants, whereas the LDA model attained the peak average accuracy of 90.27% among all participants. The optimal performance was achieved with Symlets 4 and Daubechies-2 for Scenarios 1 and 2, respectively.
Conclusions: The results demonstrate that although ML models exhibit generalization capabilities across stressors, their performance is significantly influenced by the alignment between training and testing contexts, as evidenced by systematic cross-context evaluations using an 80/20 train–test split per participant and quantitative metrics (accuracy, precision, recall, and F1-score) averaged across participants. The observed variations in performance across stress scenarios, classifiers, and EEG sensors provide empirical support for this claim.
Full article