Cross-Context Stress Detection: Evaluating Machine Learning Models on Heterogeneous Stress Scenarios Using EEG Signals

Attallah, Omneya; Mamdouh, Mona; Al-Kabbany, Ahmad

doi:10.3390/ai6040079

Open AccessArticle

Cross-Context Stress Detection: Evaluating Machine Learning Models on Heterogeneous Stress Scenarios Using EEG Signals

by

Omneya Attallah

^1,2,*

,

Mona Mamdouh

¹ and

Ahmad Al-Kabbany

^2,3,4,*

¹

Electronics and Communications Engineering Department, College of Engineering and Technology, Arab Academy for Science, Technology, and Maritime Transport, Alexandria 21937, Egypt

²

Wearables, Biosensing, and Biosignal Processing Lab, Arab Academy for Science, Technology, and Maritime Transport, Alexandria 21937, Egypt

³

Multimedia Interaction and Communication Lab, Arab Academy for Science, Technology, and Maritime Transport, Alexandria 21937, Egypt

⁴

Intelligent Systems Lab, Arab Academy for Science, Technology, and Maritime Transport, Alexandria 21937, Egypt

^*

Authors to whom correspondence should be addressed.

AI 2025, 6(4), 79; https://doi.org/10.3390/ai6040079

Submission received: 3 March 2025 / Revised: 10 April 2025 / Accepted: 11 April 2025 / Published: 14 April 2025

(This article belongs to the Special Issue Artificial Intelligence in Biomedical Engineering: Challenges and Developments)

Download

Browse Figures

Versions Notes

Abstract

Background/Objectives: This article addresses the challenge of stress detection across diverse contexts. Mental stress is a worldwide concern that substantially affects human health and productivity, rendering it a critical research challenge. Although numerous studies have investigated stress detection through machine learning (ML) techniques, there has been limited research on assessing ML models trained in one context and utilized in another. The objective of ML-based stress detection systems is to create models that generalize across various contexts. Methods: This study examines the generalizability of ML models employing EEG recordings from two stress-inducing contexts: mental arithmetic evaluation (MAE) and virtual reality (VR) gaming. We present a data collection workflow and publicly release a portion of the dataset. Furthermore, we evaluate classical ML models and their generalizability, offering insights into the influence of training data on model performance, data efficiency, and related expenses. EEG data were acquired leveraging MUSE-S^TM hardware during stressful MAE and VR gaming scenarios. The methodology entailed preprocessing EEG signals using wavelet denoising mother wavelets, assessing individual and aggregated sensor data, and employing three ML models—linear discriminant analysis (LDA), support vector machine (SVM), and K-nearest neighbors (KNN)—for classification purposes. Results: In Scenario 1, where MAE was employed for training and VR for testing, the TP10 electrode attained an average accuracy of 91.42% across all classifiers and participants, whereas the SVM classifier achieved the highest average accuracy of 95.76% across all participants. In Scenario 2, adopting VR data as the training data and MAE data as the testing data, the maximum average accuracy achieved was 88.05% with the combination of TP10, AF8, and TP9 electrodes across all classifiers and participants, whereas the LDA model attained the peak average accuracy of 90.27% among all participants. The optimal performance was achieved with Symlets 4 and Daubechies-2 for Scenarios 1 and 2, respectively. Conclusions: The results demonstrate that although ML models exhibit generalization capabilities across stressors, their performance is significantly influenced by the alignment between training and testing contexts, as evidenced by systematic cross-context evaluations using an 80/20 train–test split per participant and quantitative metrics (accuracy, precision, recall, and F1-score) averaged across participants. The observed variations in performance across stress scenarios, classifiers, and EEG sensors provide empirical support for this claim.

Keywords:

stress detection; machine learning; wavelet denoising; electroencephalograms (EEGs); mental arithmetic evaluation (MAE); virtual reality

1. Introduction

According to the British Health and Safety Executive (HSE), stress, depression, or anxiety accounted for 46% of all work-related illnesses in the UK during 2023–2024, highlighting the critical and widespread impact of stress on public health and workforce productivity [1]. With the global increase in stress levels attributed to multiple factors, the demand for effective and efficient stress detection and management systems has become paramount [2]. In addition to its effects on personal health, stress adversely affects productivity and overall quality of life. Chronic stress is associated with burnout, diminished concentration, and decreased work performance, adversely affecting businesses and economies [3,4,5]. Implementing stress detection systems enables individuals and organizations to identify stressors promptly, adopt proactive strategies for stress management, and cultivate healthier work and living environments. These systems are essential for fostering mental well-being, increasing productivity, and enhancing overall life satisfaction.

The latest developments in wearable sensors and machine learning (ML) methods have enabled real-time stress detection through the continuous monitoring of physiological responses, including heart rate variability (HRV), skin conductance (SC), electrocardiogram (ECG), electromyogram (EMG), and electroencephalogram (EEG) signals [6,7]. Among these biomarkers, EEG is extensively acknowledged in the literature as one of the most efficacious indicators for stress detection and other neurological disorders [8,9,10]. Although these technologies enhance the accessibility and prevalence of stress monitoring, a major challenge persists in guaranteeing the reliable performance of stress detection systems across various contexts, referred to as cross-detection. Cross-detection denotes the capacity of an ML model developed in one context to generalize and effectively identify stress in diverse situations. For example, a model is trained on stress-inducing scenario X and then tested on scenario Y. In practical deployments, stress detection models must contend with the variability of unseen contexts. One example of this is the difference between workplace-induced cognitive stress and emotionally driven clinical stress, which makes it impractical to collect exhaustive stress data for every possible environment [11].

Numerous established methodologies are frequently employed to induce mental stress in research environments, including the Trier Social Stress Test (TSST), the Stroop Color Test, mental arithmetic evaluation (MAE), and socially evaluative tasks. Among these stressors, MAE is frequently used as a cognitive stressor, as solving complicated mathematical equations under time constraints and evaluating pressure triggers stress-related physiological responses [12]. Along the same lines, virtual reality (VR) has recently emerged as a powerful tool in stress research. Virtual reality, a technology that immerses users in interactive 3D environments, can be used to either induce or alleviate stress, depending on the nature of the virtual experience [13,14]. The VR Trier Social Stress Test (VR-TSST), for example, has been shown to reliably induce stress in a controlled setting, eliciting physiological responses similar to those triggered by real-world stressors [15].

Developing stress detection systems that generalize well across diverse contexts is essential for robustness and enhances computational efficiency by minimizing the necessity for repetitive model training in various scenarios. A vast amount of literature has focused on stress detection and management using ML techniques, which require data for model training. However, to the best of our knowledge, much less attention has been given to studying the performance of ML stress detection models trained on data from one stress-inducing scenario and tested on another [16]. We argue that the ultimate goal of developing ML-based stress detection and management systems is to obtain models that can generalize well across different contexts.

In this research, we focus on this challenge. Particularly, we investigate the generalization capacity of ML models when their training is achieved and their testing takes place on stress data (specifically EEG samples) acquired amidst mental arithmetic evaluation (MAE) stress and VR gaming-induced stress, and vice versa. We present a pipeline for data acquisition, and we provide a part of this dataset publicly. Then, we discuss the performance of several classical ML models on this challenge, and we evaluate their generalization capacity. Furthermore, we explore various wavelet denoising functions and the different electrode configurations on the performance of stress detection. Finally, we give key insights into the relationship between model performance variations and the type of training data, highlighting the data efficiency/cost for each scenario under consideration. This research gives a better understanding of the performance of general-purpose stress detection and management systems. The novelty and contributions of this study can be summarized as follows:

This research examines the generalization capability of ML models for stress detection across various stress-inducing contexts (MAE vs. VR gaming).
Three classical ML models—linear discriminant analysis (LDA), support vector machine (SVM), and K-nearest neighbors (KNN)—are systematically assessed for their generalization capabilities across stress scenarios, providing a comparative performance analysis.
A comprehensive protocol for obtaining EEG data utilizing a MUSE-S^TM wearable device is provided, with a portion of the dataset accessible to facilitate additional research in the domain.
Various wavelet denoising functions are examined to improve signal quality, illustrating their influence on model performance and determining the most effective configurations for stress detection.
The research examines the impact of individual EEG electrodes and their combinations on classification accuracy, emphasizing optimal electrode configurations for various stress-inducing situations.

The remainder of this paper is organized as follows. First, a review of the recent literature on stress detection and management systems is provided, highlighting key advancements in ML- and VR-based stress detection. Next, the methodology used in this research is described, including the data acquisition process, feature extraction techniques, and the development of machine learning models. The experimental results are then presented, where the performance of the models across different stress-inducing scenarios is evaluated. Finally, the paper concludes with a discussion of the findings and offers potential directions for future research.

2. Related Work

This section provides an overview of stress measurement methods across real-world, computer-based, and immersive VR settings. It compares traditional approaches with AI and ML-based systems, highlighting their evolution and impact on stress detection. The discussion emphasizes how emerging technologies improve measurement accuracy and enable real-time stress monitoring in different environments.

2.1. Stress Detection in Real-World Scenarios

Numerous studies have employed various methodologies to assess stress in real-world environments. These studies often rely on physiological signals, such as HR, HRV, EMG, ECG, EEG, and SC signals, to monitor stress responses. For instance, Healey and Picard [17] conducted a foundational study assessing drivers’ stress levels using these physiological markers in real-world driving scenarios. Their findings revealed that HR and SC signals were the ones most strongly correlated with subjectively perceived stress levels. Similarly, Saeed et al. [18] examined long-term stress by analyzing EEG signals from 33 individuals, emphasizing the importance of key frequency-domain features for stress classification. Another study by Asif et al. [19] explored the influence of music on stress levels in 27 subjects, utilizing the Fast Fourier Transform (FFT) and several classifiers, such as logistic regression, sequential minimal optimization (SMO), and the multilayer perceptron (MLP), thus highlighting the modulatory effects of music on stress.

Further research by Rajendran et al. [20] focused on stress detection among college students by analyzing EEG signals before and after exams, offering insights into temporal stress patterns. Halim and Rehan [21] employed a machine learning framework to identify driving-induced stress patterns in EEG data, with SVM proving particularly effective in accurately classifying stress responses. Jebelli et al. [22] took a different approach by studying construction workers and used EEG data to achieve an accuracy of 80.32% in detecting stress, further supporting the reliability of these methods in real-world applications.

2.2. Stress Detection in Controlled-Setting Scenarios

Complementing real-world studies, research has also explored computer-based tasks as a controlled method to induce and measure stress. Kosch et al. [23] investigated pupil dilation (PD) during mathematical calculation tasks, using PD as a key indicator of mental stress. They found a positive correlation between task complexity and PD, providing a new perspective on cognitive load measurement. Hou et al. [24] extended this by studying stress responses in nine subjects using the Stroop color-word test. Their analysis, using SVM and KNN classifiers, demonstrated the effectiveness of machine learning models in detecting stress from EEG signals.

Gaurav et al. [25] focused on mental stress detection in 41 participants, employing both SVM and MLP classifiers based on different focusing tasks, illustrating the potential of combining multiple classifiers for enhanced accuracy. Similarly, Lahane et al. [26] utilized the DEAP dataset for feature extraction and applied neural networks, KNN, and classification tree classifiers to enhance stress detection capabilities. Gupta et al. [27] explored mental stress detection using EEG signals from 14 subjects and combined five different methods, demonstrating that multimodal data analysis can significantly improve stress detection accuracy. Moreover, convolutional neural networks (CNNs) have also been employed in stress research. Mane et al. [28] used azimuthal projection images from EEG signals to achieve a high accuracy of 93% in stress classification. These studies underline the effectiveness of computer-based tasks and machine learning models in stress detection, offering controlled settings to assess how cognitive load impacts physiological responses.

2.3. VR-Based Stress Detection and Management

Recent advancements have leveraged fully immersive VR environments to measure and manage stress, demonstrating the versatility of VR as both a stress inducer and a therapeutic tool. For example, Cho et al. [29] used VR to induce stress in various scenarios and employed an extreme learning machine to classify stress levels, achieving a remarkable 95% success rate. In another study, Peterson et al. [30] measured stress responses during a balancing task in both real and virtual environments, finding that VR exposure significantly increased physiological stress markers such as EEG, electrodermal activity (EDA), HRV, and HR. The study revealed that height is a significant stressor even in virtual environments, providing valuable insights for the design of VR stressors.

Along similar lines, the literature reports several studies centered on the development of systems that combine biometric sensors with virtual reality (VR). In these systems, biometric sensors are used to collect personalized physiological data, while VR serves as a medium for delivering visual content designed to either induce stress or support stress management. Within this framework, AI methods are integrated to leverage the potential of machine learning for accurate stress detection. Mateos-García et al. [31] developed a system combining AI with a wristwatch-integrated photoplethysmography (PPG) sensor and VR for real-time stress classification in driving scenarios, illustrating the potential of wearable technologies in stress monitoring. El-Habrouk [32] explored mental state recognition using EEG and VR, achieving a high accuracy of 92% during arithmetic tasks within a virtual environment. Additionally, Ahmad et al. [33] introduced a multimodal deep fusion model that used ECG data for VR stress assessment, improving classification accuracy by integrating multiple physiological signals. Moreover, VR-based Internet of Things (IoT) systems have been employed for stress measurement and relief, showcasing the effectiveness of VR in reducing stress levels. Kim et al. [34] utilized VR interviews and deep learning to effectively classify stress, further demonstrating the powerful combination of VR and biometric data for real-time stress detection.

These studies collectively highlight the diverse and evolving approaches to EEG-based stress detection, underscoring the role of VR, AI, and biometric sensors in improving accuracy and offering innovative solutions for real-time stress monitoring and management. While much research has been conducted on stress measurement, most studies focus on either a single stressor or multiple stressors examined independently, leaving a gap in understanding how machine learning models generalize across different contexts. EEG has emerged as a critical tool for collecting stress-related physiological data, providing insights into the body’s responses to stress. Machine learning classifiers, such as SVM, KNN, and LDA, have demonstrated strong performance in within-context detection and classification of stress levels from EEG data. However, the challenge of cross-context detection—training models on one stressor and applying them to another—remains underexplored, emphasizing the need for more research on generalization across varying stress-inducing scenarios.

3. Materials and Methods

This section details the methodology. It is divided into two fundamental phases: the acquisition of EEG signals and the subsequent analytical processing of these signals. Each phase is discussed in depth, outlining the systematic approach taken, the design decisions made, and the equipment used. The workflow delineated in this study is illustrated in Figure 1. The process starts with subject recruitment, wherein participants are admitted and set up for the study, likely encompassing ethical protocols and baseline physiological evaluations. Participants are subsequently subjected to two separate stress-induction scenarios: Scenario 1 employs an MAE stressor to obtain training data, a validated cognitive task requiring timed calculations under stress, whereas Scenario 2 applies a VR-based stressor to obtain training data, immersing participants in virtual environments intended to induce stress.

EEG signal acquisition takes place during both stress-induction phases (VR and MAE) and relaxation intervals to record the neural response to stress and relaxation. A two-hour interval is included between stress exposures to reduce carryover effects and restore physiological baselines, thereby improving data reliability. After data collection, signal analysis procedures are implemented on EEG signals, encompassing EEG denoising, segmentation, and feature extraction to derive pertinent features for input into ML classifiers. These classifiers are trained on annotated EEG datasets to differentiate between stress states and relaxation. These models are then assessed on novel data to determine their generalizability across various scenarios and participants.

3.1. EEG Dataset Description

3.1.1. Subject Recruitment

The study recruited five undergraduate male engineering students, aged 19 to 22, from the Arab Academy for Science, Technology, and Maritime Transport (AASTMT). These participants were selected to provide EEG signals following exposure to different stressors. Subjects were chosen according to the following inclusion criteria: (1) they exhibited no neurological or psychological illnesses and (2) they possessed healthy or corrected vision through contact lenses. In this research, we used a small sample of five participants to obtain early observations and establish proof of concept for detecting stress from physiological signals. We selected a small sample size in accordance with the previous literature, which often used small sample sizes (4 to 8 participants) in physiological signal studies to detect stress/emotion [35,36,37,38]. Such studies [35,36,37,38] validated the use of small sample sizes for early research, especially if planning to validate a method or assess proof of concept. Along similar lines, this study was designed to give early insight into the use of our approach and thus was consistent with proof-of-concept research. Nevertheless, the findings cannot be generalized to larger populations due to the small size of the pool participants and the lack of diversity of gender, age, and background. In this study, participants were informed about the study’s objectives, and it was ensured that none had health conditions, particularly stress-related disorders. Informed consent was obtained, allowing their biological signals to be used for research. The data acquisition sessions were conducted at AASTMT, prior to which approval for the study design was obtained from the Ethics Committee of the Department of Electronics and Communications Engineering at AASTMT.

3.1.2. Relaxation Protocol

Following recruitment, the participants underwent a relaxation protocol as the initial stage of data collection. The protocol consisted of a four-minute period during which participants listened to a standardized piece of relaxing piano music. Their EEG signals were recorded while they listened to the music. To ensure mental dissociation from the relaxation phase and the next data collection phase, a two-hour break was provided after the relaxation session. The stress-induction procedures were separated by a two-hour interval to reduce carryover effects and ensure that participants returned to a physiological and psychological baseline before being exposed to the second stressor. This approach is consistent with the existing literature examining stress-associated effects measured using EEG. For example, the authors of reference [39] noted that an EEG signal can reflect some aspect of a participant’s residual cognitive state and/or residual emotional state if they are not given adequate time to recover before commencing the next recording session. Likewise, the authors of reference [40] noted that neural activity changes, as an outcome of stress, could increase heart rate and blood pressure. Additionally, stress can alter glucose levels. These physiological factors typically require time to stabilize and return to baseline levels, meaning that sufficient relaxation time is also needed to confidently collect accurate data for this research. By providing a two-hour break, we suggest that the EEG data represent separate responses to each stressor rather than carryover effects from the previous task. Overall, this approach enhances the validity of cross-context generalization in stress detection models.

3.1.3. VR-Based Stress Induction

After completing the relaxation phase, the participants were introduced to the first stressor using VR. This was achieved by equipping them with Oculus Quest™ VR headsets and engaging them in a two-minute session of the rhythm-based game Beat Saber™. In this game, players use virtual lightsabers to slash incoming blocks in sync with musical beats, requiring quick reflexes and precise timing, which effectively induces stress. The decision to use Oculus Quest™ was based on its ergonomic design, standalone functionality, lightweight nature, cost-effectiveness, and strong community support.

Immediately after the VR-based stress session, participants were instructed to put on MUSE S™ EEG headbands to capture their post-stress EEG signals. The MUSE S™ headband was chosen for its ease of use, cost-effectiveness, and suitability for mobile EEG signal acquisition. It contains five electrodes: two frontal electrodes (TP9 and TP10), two temporal electrodes (AF7 and AF8), and a reference electrode (Fpz) on the forehead. A ground electrode is also placed on the forehead, as shown in Figure 2. In some instances, up to 90 s was required for the EEG headgear to achieve acceptable readings, defined as all electrodes detecting EEG signals with the minimum recognizable power. The EEG signals were captured using the Mind Monitor™ app (Version 2.3.1), then exported to a computer for further analysis. It is worth mentioning that the user interface of the Mind Monitor app provides us with signal strength indicators for each electrode, making it easier to track and manage the acquisition quality. A second two-hour break was provided to ensure sufficient mental dissociation from the VR stressor and prior EEG data collection. Prior to applying the EEG headbands to participants, both electrodes in the bands and participants’ scalp regions were cleaned using wipes to minimize impedance resulting from sweat. Furthermore, to prevent hair from influencing the EEG data acquisition, participants with thick or long hair were not allowed to participate in the experiment. Additionally, sweating and other environmental conditions that could interfere with EEG signal acquisition were avoided by keeping the room temperature controlled and ensuring that subjects were sitting comfortably during recordings. These measures conform to recognized best practices in EEG research.

3.1.4. Mental Arithmetic Evaluation Task and EEG Acquisition

Prior to beginning the MAE task, participants completed a questionnaire to assess their baseline stress levels. This step was essential to ensure that any stress detected was not automatically attributed to the task itself, thus minimizing potential bias in the findings. Following the questionnaire, participants were presented with a series of arithmetic problems incorporating a mix of addition, subtraction, multiplication, and division. Each participant completed a two-minute session, matching the duration of the prior VR session, to solve approximately ten arithmetic problems. This setup was designed to mirror the VR session’s intensity and duration for consistency across the experimental conditions.

During this task, participants were instructed to focus solely on the laptop screen displaying the arithmetic problems and to refrain from any conversation. EEG data acquisition occurred simultaneously, capturing the participants’ physiological responses to the cognitive load imposed by the arithmetic tasks. The stages and durations of the experimental protocol, including the relaxation and stress-induction phases, are illustrated in Figure 3.

3.2. EEG Signal Analysis

EEG signal analysis was conducted within MATLAB™, utilizing its comprehensive resources and libraries, particularly the EEG Lab toolkit, which was pivotal for preprocessing, processing, and managing EEG signals. This section will describe the signal preprocessing and processing steps of the proposed framework. These steps start with EEG signal preprocessing, which includes denoising and windowing processes to remove noise and segment EEG signals. After that, comes the feature extraction step to obtain meaningful features from each EEG segment. Finally, the classification step involves three ML classifiers to detect stress under diverse stress scenarios.

3.2.1. EEG Signal Preprocessing

The preprocessing involved a multistage pipeline, beginning with bandpass filtering within a 1–30 Hz frequency range, followed by signal decomposition through Independent Component Analysis (ICA) to effectively isolate and eliminate noise components. This process utilized the ICLabel function in the EEG Lab. Data re-referencing was consistently applied after each preprocessing stage to maintain data integrity and accuracy. The EEG signals were preprocessed with a 1–30 Hz bandpass filter in order to capture the relevant EEG frequency bands of interest: the delta (1–4 Hz), theta (4–8 Hz), alpha (8–13 Hz), and beta (13–30 Hz) bands, which have been implicated in aspects of stress-related brain activity more broadly [42,43]. These waves are crucial for comprehending an individual’s mental state. The amplitude and frequency of these waves vary with distinct cognitive states. Consequently, they are instrumental in discerning an individual’s mental state, including stress [44]. Moreover, the 1–30 Hz range effectively eliminates high-frequency noise (such as muscle artifacts exceeding 30 Hz) and low-frequency drifts (such as movement artifacts below 1 Hz), thereby ensuring cleaner and more interpretable EEG signals. This methodology is consistent with recognized protocols in EEG-based stress detection research [42,43,44].

Afterwards, EEG signals were denoised using Multiscale Wavelet Transform (MSWT) with different wavelet families. Noise removal was initially performed using the default wavelet denoising function—Symelets 4 (Sym4). Then, additional wavelet denoising families were also explored to assess the impact on accuracy across the two scenarios under consideration. Specifically, Symlets (sym), Daubechies (db), and Coiflets (coif) wavelet families were employed due to their proven noise-reduction capabilities in EEG studies, particularly in mitigating power-line interference and other artifacts while preserving signal integrity [45,46,47,48], with specific wavelets including sym1, sym2, sym3, sym4, db1, db2, db3, db4, and COIF3. Eventually, COIF3 was chosen for its demonstrated effectiveness in EEG signal denoising, particularly for power-line noise (PLN) and electromyogram (EMG) noise. The MSWT, to denoise EEG signals, divided the EEG signals into 100-sample windows with a 50-sample overlap, using a sampling frequency of 300 Hz. The denoised EEG data were then segmented into overlapping frames, using an incremental binning strategy that created segments overlapping by 50 milliseconds, allowing for detailed signal coverage.

3.2.2. Feature Extraction

A structured methodology for feature extraction was developed to analyze mental states through EEG signals, leveraging advanced signal processing techniques in MATLAB™. Prior research has shown that for enhanced accuracy in detecting emotional activities such as stress, it is advantageous to obtain attributes from the frequency domain [49]. Other research has indicated that the integration of time and frequency variables significantly enhances stress detection rates [50]. Consequently, this paper proposes the utilization of time and frequency attributes. Among frequency features, median frequency (MDF) analysis was subsequently performed on each segment’s power spectrum, characterizing EEG signals in the frequency domain. Periodogram methods were used to estimate the power spectrum, and the median frequency served as a central indicator within the frequency distribution. To enhance this analysis, the median frequency mean deviation (MFMD) was calculated, quantifying variability around the median frequency. Fourier transform techniques were applied to offer in-depth insights into the frequency characteristics of the EEG signals, while time-domain power spectral moments provided additional signal characterization by analyzing the power distribution across different frequencies within the EEG signals. The entire preprocessing and the subsequent processing steps are illustrated in Figure 4, depicting the techniques used in signal preprocessing and feature extraction. After extracting features, they were compiled into a feature matrix for each EEG channel, showcasing the complexity of EEG signal analysis and revealing subtle mental states.

The selected features are physiologically relevant for stress detection. For example, the median frequency (MDF) captures the central tendency of EEG spectral power and reflects shifts toward higher frequencies under cognitive load. The MFMD quantifies variability around this frequency, which may indicate fluctuations in stress responses. Spectral moments (zero-, second-, and fourth-order) capture signal power, dispersion, and peakedness, respectively—metrics known to vary during stress-related neural activation. These features together offer a robust representation of both frequency-domain characteristics and spectral dynamics associated with stress. The Equations (1)–(5) employed to calculate these features are shown below.

M D F = \frac{1}{2} \sum_{j = 1}^{M} P S D_{j}

(1)

M F M D = \frac{1}{2} \sum_{k}^{M} A_{k}

(2)

where M denotes the magnitude of the power spectral density, PSDj represents the jth entry of the power spectral density, and A_k represents the EEG magnitude spectrum at frequency index k. A sampled form of the EEG segment is represented as x_i, where i = 1, 2, …, N, with a length of N. In addition,

{\bar{m}}_{o}

,

{\bar{m}}_{2}

, and

{\bar{m}}_{4}

represent the root squared zero-, second-, and fourth-order moments. The discrete frequency transform of a segmented EEG signal can be represented as an expression of frequency X_k. P_k denotes the phase-excluded power spectrum, which corresponds to the product of X_k and its conjugate, X_k*, divided by N.

{\bar{m}}_{o} = \sqrt{\sum_{i}^{N - 1} x_{i}^{2}}

(3)

\begin{array}{l} {\bar{m}}_{2} = \sqrt{\sum_{k = 0}^{N - 1} k^{2} P_{k},} \\ P_{k} = \frac{1}{N} \sum_{k = 0}^{N - 1} |X_{k} X_{k}^{*}| \end{array}

(4)

\begin{array}{l} {\bar{m}}_{4} = \sqrt{\sum_{k = 0}^{N - 1} k^{4} P_{k},} \\ P_{k} = \frac{1}{N} \sum_{k = 0}^{N - 1} |X_{k} X_{k}^{*}| \end{array}

(5)

3.2.3. Classification

For classification, three machine learning models were utilized: SVM, KNN, and LDA classifiers. These non-parametric models, widely used in brain–computer interface (BCI) applications, have demonstrated strong capabilities for EEG classification, as they do not require assumptions about the underlying data distribution. Each model was configured using default parameters, as specified in the EEG Lab library. MATLAB^® Classification Learner (MATLAB 2023 a) was used to ensure the consistent application of default settings across all model instances. The quadratic kernel was used for SVM, so it is named throughout the manuscript SVM Q. Furthermore, for KNN, the Euclidean distance metric was used with a medium number of neighbors equal to 10, as referred to by MATLAB^®. Therefore, it is named KNN M throughout the paper. While for LDA, the default parameters were used.

An 80–20 train–test ratio was maintained throughout the experimental design, as shown in Figure 1. In Scenario 1, the full set of EEG data samples from the MAE-based stressor was used for training, while approximately 25% of the VR-induced stress data was designated for testing. This procedure was conducted individually for each participant, meaning each model was trained and subsequently tested on data from the same participant across different stressors. This individualized approach resulted in a comprehensive aggregation of performance metrics, which will be detailed in the following chapter.

4. Experimental Results and Discussion

This section presents an in-depth evaluation of the experimental setups and performance measures and a detailed analysis of participant data, focusing on the effectiveness of ML classifiers in detecting stress across varying stress inducers from EEG data.

4.1. Experimental Setup

Each participant’s EEG data were thoroughly analyzed using three classifiers to systematically assess their effectiveness across various subjects. The evaluation was structured into two distinct experimental setups, each simulating different stress-inducing environments to test classifier efficacy. In the first setup, Scenario 1, the EEG data from the MAE served as the training dataset, while EEG data from the VR-based stressor were used for testing. Scenario 2 flipped the stressors. This strategy for setting up the experiment aimed to determine whether the classifiers could reliably detect stress across different contexts, offering insights into their applicability in diverse stress-inducing situations. In each scenario, the setup involved recording EEG signals from various electrodes in the adopted EEG headset, as shown in Figure 5. It is worth mentioning that this setup is for one participant, and the same setup was used with all other participants. Please see Section 3.1.3 for more information on the headset’s electrodes that were utilized in the experiment.

Simulations were conducted on a Windows-based system with an Intel Core i5 processor and 8 GB of RAM, utilizing MATLAB^® for scientific computing. MATLAB^®’s robust tools and libraries facilitated the efficient and accurate execution of simulations, contributing valuable insights to the field of classification in EEG signal processing.

4.2. Performance Metrics

A set of performance metrics commonly used in image classification research was employed to assess classifier performance, including accuracy, precision, F1-score, and recall. These metrics are calculated using the number of false positives (FPs), true positives (TPs), true negatives (TNs), and false negatives (FNs), as shown in Equations (6)–(9). These metrics provided a comprehensive evaluation of accuracy and reliability for each classifier, aligning with standard practices in similar studies to ensure relevance and comparability with existing research in the field.

A c c u r a c y = \frac{T P + T N}{T N + F P + F N + T P}

(6)

R e c a l l = \frac{T P}{T P + F N}

(7)

P r e c i s i o n = \frac{T P}{T P + F P}

(8)

F 1 - S c o r e = \frac{2 \times T P}{(2 \times T P) + F P + F N}

(9)

4.3. Electrode-Specific Performance Analysis Across Diverse Stress Scenarios

In this section, data analysis was conducted on an electrode-by-electrode basis, with four electrodes evaluated per participant: AF7, AF8, TP9, and TP10, as shown in Figure 4. Data from each electrode were independently analyzed rather than aggregated, allowing an in-depth analysis specific to each electrode and participant. This approach maintained a high level of detail, ensuring that each electrode’s unique contribution was understood.

4.3.1. Scenario 1: MAE Data as Training Data and VR Data as Testing Data

Table 1 presents the performance evaluation of the AF7 electrode across five participants utilizing three classifiers (LDA, SVM, and KNN), highlighting notable trends in accuracy, precision, recall, and F1-score. The LDA classifier exhibited an average accuracy of 69.23%, accompanied by considerable variability among participants. Participant 5 attained the highest individual accuracy at 99.29%, whereas Participant 3 recorded the lowest at 47.86%. Notwithstanding its diminished accuracy, LDA demonstrated robust precision, averaging 91.76%, with maxima of 98.65% (Participant 2) and 98.57% (Participant 5). Nonetheless, recall values were significantly lower, averaging 68.09%, with Participant 3 (48.91%) and Participant 2 (52.52%) exhibiting subpar performance. The mean F1-score for LDA was 76.49%, indicating a balance between precision and recall; however, Participant 3’s low score of 64.73% emphasized inconsistencies.

On the other hand, the SVM classifier demonstrated enhanced performance compared to LDA, attaining an average accuracy of 76.73%. Participant 4 achieved an accuracy of 90.41%, while Participant 2 recorded an accuracy of 58.11%. Precision remained strong, averaging 94.22%, with Participant 4 (95.89%) and Participant 3 (94.29%) enhancing this performance. Recall enhanced to an average of 73.45%, driven by Participant 4 (86.42%) and Participant 5 (100%), while Participant 2 (54.55%) persisted in presenting a deficiency. The mean F1-score increased to 81.53%, with Participant 4 (90.91%) and Participant 5 (98.55%) demonstrating SVM’s proficiency in balancing precision and recall. The KNN classifier surpassed its competitors, attaining the highest average accuracy of 78.96% and outperforming in all metrics. Participant 4 attained 100% accuracy, the highest individual score in the study, whereas Participant 2 recorded the lowest score at 47.30%. The precision was outstanding, averaging 95.36%, with Participants 3 and 4 achieving 97.14% and 100%, respectively. Recall also enhanced, averaging 76.84%, with Participants 4 and 5 achieving 100%, whereas Participant 2 (48.61%) encountered difficulties. The F1-score reached an average of 83.90%, enhanced by Participant 4’s flawless 100% and Participant 3’s 82.42%, highlighting KNN’s exceptional equilibrium between metrics.

These results reveal considerable variability among participants. Participant 5 consistently demonstrated superior performance across classifiers, attaining near-perfect accuracy (99.29% with LDA and KNN) and recall (100% with LDA and SVM). Participant 4 distinguished themselves, notably achieving 100% accuracy with KNN and 90.41% accuracy with SVM, indicating a robust alignment between their data and these models. In contrast, Participants 2 and 3 encountered difficulties, as Participant 2’s accuracy decreased to 47.30% (KNN) and Participant 3’s accuracy fell to 47.86% (LDA), suggesting possible complications such as data variability or physiological disparities.

The confidence intervals and ANOVA findings of Table 2 provide essential details regarding the comparative performance validity of our classifiers for the A7 electrode. The relatively wide confidence intervals displayed by our accuracy parameters (LDA: [50.12, 88.34], KNN: [55.38, 102.54]) suggest that our accuracy metric has considerable variability across participants and may not be generalizable to a wider population. However, precision parameters exhibited narrower confidence intervals (LDA: [82.54, 100.98]), meaning these metrics exhibited more reliable performance across individuals. The ANOVA showed statistical significance, with p-values for accuracy (p = 0.021), recall (p = 0.018), and F1-score (p = 0.034), thus confirming that KNN’s performance is significantly stronger than both other classifiers in these two areas. The precision metrics, however, displayed non-significant differences (p = 0.104), suggesting that all three models perform similarly in classifying positive cases accurately. Although KNN exhibits the best overall performance with respect to each of the parameters assessed, the variance indicates either inconsistency of performance or difficulties generalizing these preliminary results.

Similarly, Table 3 presents detailed performance metrics specific to the AF8 electrode data, following the same format as that used for the AF7 sensor. This table includes metrics for each participant, demonstrating the performance of the LDA, SVM, and KNN classifiers when applied to AF8 data. Table 2 shows the results for Scenario 1 using the AF8 electrode. The SVM classifier for the AF8 electrode (Table 2) attained the greatest average accuracy of 87.51%, surpassing LDA (78.88%) and KNN (86.60%). Participant 1 exhibited outstanding performance with SVM (97.30% accuracy), whereas Participant 5 also displayed commendable results (94.29% accuracy). The SVM classifier exhibited high average precision (97.16%) and average recall (83.60%), yielding a strong average F1-score of 89.13%. This consistency highlights the reliability of SVM for AF8 data. These results are greater than those achieved using AF7 (Table 1).

Incorporating the confidence intervals and ANOVA test results offered a more robust statistical assessment of the performance metrics, as presented in Table 4. The 95% confidence intervals indicate participant variability in classifier performance, with a smaller range (e.g., precision for SVM: 93.76–100.56) indicating more consistent performance and a larger range (e.g., accuracy for LDA: 64.50–93.26) indicating less consistent performance. The ANOVA confirms that there were statistically significant differences in performance across the classifiers for all metrics evaluated (p < 0.05), indicating that the differences observed were not due to chance. It is important to note that SVM performed consistently well in terms of average accuracy (87.51%) and F1-score (89.13%) and had minimal range variability, which reinforces the reliability of the classifier for detecting stress using the AF8 electrode. These findings lessen the concern of overfitting, since they suggest that the high accuracy claims noted in the results are supported by statistically significant trends and variability at the participant level.

Table 5 provides a comprehensive overview of the LDA, SVM, and KNN classifiers’ performance metrics, specifically for the TP9 sensor. For the TP9 electrode (Table 5), the LDA classifier achieved the highest average performance with an average accuracy of 69.76%, although the overall metrics were inferior relative to the other electrodes. Participant 1 attained an accuracy of 87.16% using LDA, while performance exhibited significant variability, with Participant 3 and Participant 4 achieving accuracy scores of 50.00% and 69.18%, respectively. LDA’s elevated average precision (89.42%) set against its diminished average recall (69.39%) indicates possible difficulties in identifying true positives. The average F1-score for LDA (76.06%) exceeded that of SVM (75.00%) and KNN (74.81%), confirming its comparative superiority for TP9 despite overall modest performance.

By including CIs and ANOVA test statistics, the performance metrics for the TP9 electrode were evaluated more rigorously as shown in Table 6. The CIs indicate that accuracy, precision, recall, and F1-score exhibited considerable variability subject to the behavior of the participants, especially for the LDA and KNN classifiers. For instance, the 95% CI for accuracy indicated a range of approximately 39 percentage points for LDA (50.23–89.29%), indicating considerable participant-level variability among responses. ANOVA results also indicated that recall was the only metric for which statistically significant differences were found between classifiers (p = 0.037) suggesting that, while classifiers performed similarly regarding accuracy, precision, and F1-score, there remained a difference in ‘ability’ to identify true positives. Furthermore, finding no significant differences in accuracy across classifiers (p = 0.081) would suggest that claims regarding high accuracy would add little to the classifying performance and would likely not be generalizable across all contexts or participants without overfitting. Overall, the data indicate that both variability and statistical rigor should be considered when interpreting performance metrics associated with classifiers, particularly when working in a small-sample-size context.

Table 7 covers the results of the TP10 electrode, providing a thorough breakdown of performance metrics for individual participants. This table follows the same format as those for the AF7, AF8, and TP9 sensors, ensuring consistency in data presentation. The TP10 electrode (Table 7) demonstrated the most robust results, with the SVM classifier attaining a maximum average accuracy of 95.76%, the highest among all sensors. Participant 3 and Participant 4 exhibited exceptional performances, achieving accuracies of 92.86% and 93.84%, respectively. Furthermore, SVM achieved an exceptional average precision (97.30%), average recall (94.59%), and average F1-score (95.85%), indicating a nearly optimal balance. Although LDA and KNN exhibited lower averages (89.32% and 89.17%, respectively), Participant 4 attained 96.55% accuracy with LDA, demonstrating intermittent excellence but diminished consistency.

The incorporation of CIs and ANOVA test statistics enhanced the robustness of the evaluation of the performance metrics reported for the TP10 electrode. As shown in Table 8, the 95% CIs indicate a likely range for the true population mean, which also indicates variability. For example, the SVM classifier demonstrated superior average accuracy (95.76%) relative to the other classifiers with a narrow 95% CI (91.14–100.00%), indicating strong and consistent performance across participants. The ANOVA test, however, revealed statistically significant differences in performance metrics across the three classifiers: accuracy (p = 0.031), recall (p = 0.045), and F1-score (p = 0.022), which suggests that these performance measures are not uniformly distributed across the conditions. This suggests, however, that the SVM classifier performed better than both the LDA and KNN classification approaches in both accuracy and recall measures based on the higher means and CIs related to recall shown for the SVM. These results validate the proposition that the SVM classifier is effective for stress detection data derived from the TP10 electrode while demonstrating a need to factor in participant-level variability and statistical significance when making high-accuracy claims. The demonstrated lack of significance with precision (p = 0.102) indicates that all three classifiers performed similarly to identify positive cases. Differences in recall and F1-score emphasize the need to carefully select a model, particularly when considering the balance of sensitivity and specificity as part of the analysis.

The analysis of the results in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 disclosed multiple significant findings. Participant 4 attained the highest individual accuracies of 96.55% and 93.84% with the LDA and SVM classifiers, respectively, using the TP10 electrode, while Participant 1 achieved 99.32% and 97.30% individual accuracies using LDA and SVM, respectively, employing the AF8 electrode. The results demonstrate remarkable individual performances, especially for electrodes linked to posterior (TP10) and frontal (AF8) brain regions, indicating localized signal effectiveness for stress detection. The TP10 electrode combined with SVM exhibited the highest average accuracy at 95.76%, surpassing all the other electrode–classifier combinations, while AF8’s SVM attained 87.51%. This highlights the TP10 electrode’s enhanced reliability in detecting stress-related patterns among participants, likely attributable to its positioning in posterior regions, which may be essential for identifying physiological stress responses.

Participant variability was significant, exhibiting marked differences in performance. For instance, Participant 3 attained a score of merely 50.00% for both accuracy and recall with LDA, KNN, and SVM on TP9, whereas Participant 1 achieved an accuracy of 87.16%, 85.81%, and 83.78% utilizing the identical electrode with the LDA, SVM, and KNN classifiers, respectively. These discrepancies highlight fundamental variances in individual EEG signal characteristics, potentially affected by anatomical variations, consistency of electrode position, or distinct stress response patterns. SVM exhibited superior efficacy for AF8 and TP10, attributed to its capacity to manage non-linear and intricate EEG data patterns, as indicated by its high precision (up to 97.30%) and recall (up to 94.59%). Conversely, LDA, a linear model, demonstrated moderate performance for TP9 (69.76% average accuracy), indicating either simpler signal dynamics in that area or constraints in capturing subtle stress-related characteristics.

The findings suggest that the TP10 electrode, especially with SVM, is crucial for stress analysis, likely due to the abundance of stress-related neural activity in posterior brain regions. The superior performance of SVM is attributed to its ability to model complex patterns in EEG data, while LDA’s linear assumptions may restrict its effectiveness in more dynamic contexts. The considerable variability among participants highlights the necessity for personalized strategies, such as adaptive feature selection or tailored classifier tuning, to enhance the robustness of EEG-based stress detection systems. This may reduce the effects of biological variability and improve cross-subject generalizability in future applications. To provide a comprehensive overview of our findings, we calculated the average accuracy for the three classifiers, considering results from each participant and electrode. These aggregated results are presented in Figure 6, where a clear trend emerges: each electrode’s highest accuracy is achieved by a different ML model. For instance, KNN yielded the highest accuracy (78.96%) for AF7, while SVM was the most accurate (87.51%) for AF8. For TP9, the peak individual accuracy (79.89%) was achieved using KNN. Notably, the TP10 sensor recorded the highest average accuracy, with SVM achieving a peak accuracy of 95.76% and an overall sensor average of 91.42%. This emphasizes the significance of the TP10 sensor in our study, marking it as a focal point for further analysis.

In Figure 7, we present a different perspective. Effectively, it moves us closer to the vision, stated above, of having personalized strategies for stress detection. We compared the maximum metric values, across all models and electrodes for each participant. Basically, we analyzed Table 1, Table 2, Table 3 and Table 4, and we recorded, for each participant, the model and the electrode that attained the maximum value for each of the four adopted performance metrics. Especially for Participants 1, 4, and 5, it is apparent that there is an electrode that consistently gives higher values for all the adopted metrics than the other electrodes. It is also apparent that, on an individual level, we might be able to find a model that is more indicative than other models, e.g., KNN for Participant 4 and LDA for Participants 1 and 5. We believe that, following this analysis, a stress detection system can be tuned or customized to perform better on an individual level.

4.3.2. Scenario 2: VR Data as Training Data and MAE Data as Testing Data

In Scenario 2, where classifiers were trained on VR-generated stress data and tested on MAE stress data, the results shown in Table 5 center on the performance of stress detection models employing the AF7 electrode. For the three classifiers—LDA, SVM, and KNN—the table offers key performance measures across five participants. The findings show variation in classifier performance among individuals. Table 9 shows that SVM had the best average accuracy (80.02%), followed by LDA (78.15%) and KNN (75.97%), among the classifiers. Using LDA and SVM, Participant 2 achieved the best individual accuracies (93.64% and 91.36%, respectively), indicating strong model performance for this particular participant. With LDA, Participant 3 had the lowest accuracy (59.13%), reflecting perhaps unique individual variations in stress responses or EEG signal quality.

Across all classifiers, precision values were continually high; SVM attained the highest average precision (96.24%), shortly followed by KNN (96.19%) and LDA (94.03%). These findings show that, when predicting stress, all of the classifiers were quite confident; yet precision by itself cannot explain missing detections. For average recall scores, SVM attained 75.30%, outperforming LDA (74.59%) and KNN (70.38%). The average F1-score was also greatest for SVM (83.49%), followed by KNN (80.57%) and LDA (82.05%). These results confirm that, although all three classifiers showed great accuracy, SVM was the most consistent in stress detection among participants; hence, it is the best-performing model for the AF7 electrode in this context.

Participant 2 routinely scored better accuracy across all classifiers, with accuracies of 93.64% (LDA), 91.36% (SVM), and 85.91% (KNN). This suggests that, maybe because of clearer stress-related neural patterns, EEG signals from this individual were more consistently classified. With accuracy values of 59.13% (LDA), 63.04% (SVM), and 64.78% (KNN), Participant 3 displayed the lowest classification performance, demonstrating either greater variation in stress responses or signal inconsistencies that influenced model predictions.

The addition of CIs and ANOVA test statistics contributes to a more robust assessment of the reported performance metrics of the AF7 electrode from Scenario 2, as displayed in Table 10. The 95% CIs demonstrate the variation in the classifier performance metrics for each participant across all trials, and any narrower intervals indicate a greater level of consistency. For instance, the SVM classifier reported a higher average accuracy (80.02%) and relatively narrow confidence (68.79–91.25) as compared to the other two classifiers. Thus, this classifier had the highest reliability in performing a robust and consistent classification on the accuracy metric at the AF7 electrode location. Although the ANOVA test showed an overall statistically significant difference among the classifiers for accuracy (p = 0.048), recall (p = 0.032), and F1-score (p = 0.041), that is not to say that each metric was characterized by a uniform distribution amongst the classifiers. The SVM also had greater accuracy and recall than the LDA and KNN classifiers, as we would expect from its average and 95% confidence interval metrics. These results support the rationale for stating that the SVM classifier was the most effective stress detection classification for the AF7 electrode, but it also highlighted the need to include participant-level variability and statistical significance when discussing potential high-accuracy claims. The lack of significance in precision (p = 0.087) indicates that all three classifiers performed similarly in correctly classifying positive cases, but differences in recall and F1-score highlight the consideration needed in selecting models with respect to sensitivity and specificity.

Table 11 displays the classification performance of the AF8 electrode in Scenario 2. The LDA classifier attained the highest average accuracy of 83.16%, followed by SVM at 82.12% and KNN M at 80.54%, suggesting that AF8 is a better choice than AF7 for stress detection in the Scenario 2 context. Upon examining the performance of individual participants, Participant 2 attained the highest accuracy (100%) across all classifiers, indicating remarkable consistency in stress classification using AF8. Likewise, Participant 1 demonstrated consistently high accuracy (97.87%, 97.87%, and 96.81%) across LDA, SVM, and KNN, evidencing the reliability of the AF8 electrode for these subjects. Conversely, Participant 5 demonstrated the lowest classification accuracy (between 56.85% and 58.90%), indicating possible discrepancies in the measurement of their stress responses with AF8. Participant 3 demonstrated moderate performance, achieving accuracy rates ranging from 78.70% to 82.61%, whereas Participant 4 showed marginally inferior accuracy (68.97–76.44%), highlighting variability in classifier efficacy among individuals.

Upon analyzing classifier-specific trends, LDA attained the highest mean performance across various metrics, rendering it the most proficient model for AF8 stress classification in this context. LDA exhibited the optimal equilibrium among average precision (96.31%), average recall (79.82%), and average F1-score (86.40%), emphasizing its efficacy in preserving classification stability. The SVM classifier achieved an average precision of 95.45% and an average F1-score of 85.58%, demonstrating competitive performance. The KNN classifier, although it attained the lowest average accuracy (80.54%), exhibited the highest average precision (95.64%), signifying its proficiency in accurately identifying stress-related instances, even though with occasional misclassifications due to a lower recall (77.16%). A notable observation is that Participant 2 demonstrated flawless classification performance metrics (100%) across all models, indicating that their EEG stress responses were markedly distinguishable under various stress conditions. Conversely, Participant 5 exhibited the lowest accuracy, with values declining to 56.85% (SVM), indicating that stress detection efficacy is affected by individual physiological differences.

The CIs and ANOVA test statistics were obtained to provide a more thorough assessment of the overall performance metrics for the AF8 electrode in Scenario 2, as presented in Table 12. The 95% confidence intervals demonstrate participant variability in performance. For instance, the precision had a narrower range, (i.e., 93.19–99.43) for LDA, which meant greater consistency in correctly identifying positive cases. Conversely, accuracy, recall, and F1-scores had wider ranges, signifying significant variability at the participant level (i.e., some participants had a lower accuracy score, such as Participant 4). ANOVA test statistics showed that accuracy (p = 0.042), recall (p = 0.038), and F1-scores (p = 0.047) were statistically significant, which implies that these metrics were not uniformly dispersed. Notably, LDA had the highest average accuracy (83.16%) and recall (79.82%) performance metrics. ANOVA indicated that LDA’s performance (accuracy and recall) was significantly different from that of SVM and KNN. This indicates that AF8-based forms of stress detection with LDA in this scenario are valid, but participant variability and statistically testing significance of claims may also be important in understanding levels of high accuracy. The fact that precision was not statistically significant (p = 0.089) suggests a similar level of precision for each classifier when identifying positive cases correctly and demonstrates a need for careful model selection and the differentiation between recall, F1-score, and precision, as these conflict with sensitivity while varying with respect to specificity.

Table 13 presents a comprehensive assessment of the classification performance for the TP9 electrode among five participants in Scenario 2. Among the three classifiers, KNN attained the highest average accuracy of 79.89%. A notable discrepancy in accuracy was observed among participants. Participant 2 attained the highest accuracy (97.73%) utilizing KNN, rendering it the optimal individual case in this context. Participant 1 excelled, attaining 91.35% accuracy with LDA, whereas Participant 3 exhibited inferior accuracy across all classifiers, achieving a maximum performance of 87.83% with KNN.

Participant 4 exhibited the lowest overall accuracies of 59.77% (KNN), 47.70% (LDA), and 62.07% (SVM), signifying challenges in generalizing stress patterns for this individual. The precision scores among classifiers were predominantly elevated, especially for KNN, which exhibited an overall average precision of 94.98%, indicating robust classifier confidence in accurately identifying stress cases. Participant 2 attained outstanding precision (95.45%) with KNN, whereas Participants 3 and 1 also demonstrated elevated precision values across various classifiers. Participant 4 exhibited reduced accuracy with KNN (59.77%), indicating that KNN’s performance is more influenced by individual variability.

Regarding recall, the values were comparatively lower than those for precision, with KNN attaining an average recall of 76.37%, LDA achieving 76.73%, and SVM reaching 71.64%. This indicates that although the models were proficient in identifying stress, certain instances were erroneously classified as non-stress cases, resulting in false negatives. The average F1-score was highest for KNN (83.71%), highlighting its dependability in cross-scenario stress detection. The maximum accuracy achieved with LDA was 96.36% for Participant 2, whereas the minimum was 47.70% for Participant 4, demonstrating the classifier’s susceptibility to individual differences. The average F1-score for LDA was 82.47%. SVM exhibited a marginally inferior performance, with its optimal result recorded for Participant 2 at 92.73% and the lowest accuracy of 57.83% noted for Participant 3. The average F1-score for SVM was 79.15%; however, its performance was surpassed by KNN in terms of overall robustness.

Table 14 displays the CIs and ANOVA test statistics for the performance metrics reported for the TP9 electrode in Scenario 2. The KNN classifier obtained the highest mean accuracy (79.89%) with a confidence interval between 67.28 and 92.50 but indicated fairly strong variability between participants. The ANOVA test showed statistical differences for accuracy (p = 0.042), recall (p =0.038), and F1-score (p = 0.049), which demonstrates that these values were not normally distributed across all classifiers. KNN yielded higher averages with tighter CIs than LDA and SVM for accuracy and the F1-score; therefore, KNN is particularly effective for determining stress using the TP9 electrode. The variability for KNN does indicate some non-uniformity that should be observed. Additionally, accuracy is not the only key performance metric, and it must be noted that the precision statistic does not yield statistical significance (p = 0.124), suggesting that all classifiers perform fairly similarly in identifying positive classes.

A thorough performance comparison of stress classification using the TP10 electrode in Scenario 2 is given by the results in Table 15. It can be observed from Table 15 that LDA outperformed SVM (83.98%) and KNN (83.92%), exhibiting the highest average accuracy of 84.05% among the classifiers. Participant 2 had the highest recorded classification performance in this scenario, achieving 99.55% individual accuracy using KNN. Using LDA and SVM, Participant 2 demonstrated excellent performance as well, attaining 97.73% and 98.18% accuracies, respectively. With the lowest accuracy values (75.53% for LDA, 69.68% for SVM, and 62.77% for KNN), Participant 1 demonstrated that individual variability influences the performance of EEG-based stress detection. All classifiers showed continuously high precision values, demonstrating a high level of classifier confidence in stress state prediction. With an average precision of 96.16%, KNN outperformed SVM (93.89%) and LDA (91.29%). In this instance, the model’s stability was further supported by the fact that Participant 2 had the highest individual precision (100%) using the KNN classifier.

Regarding recall, KNN obtained an average recall of 79.94%, SVM recorded 81.05%, and LDA displayed an average recall of 81.72%. With KNN, Participant 2 had the highest recall score (99.10%), while with KNN, Participant 1 had the lowest recall (57.69%). This suggests that although classifiers had a high degree of confidence in their predictions (high precision), different participants had different levels of success in correctly identifying stress instances. KNN received the greatest mean F1-score (86.59%), followed by LDA (85.60%) and SVM (86.11%). Again, using KNN, Participant 2 had the best F1-score (99.55%), indicating the classifier’s resilience in detecting stress patterns in this individual.

Table 16 shows the CIs and ANOVA test statistics for the four performance metrics reported for the second scenario with the TP10 electrode. A review of the table indicates that the average accuracy for the LDA classifier was 84.05% and was accompanied by a CI of 74.38–93.72%, indicating that the classifier performed uniformly across participants. The ANOVA test indicated a statistically significant difference (p = 0.018) in accuracy as well as precision (p = 0.022), recall (p = 0.041), and F1-score (p = 0.029) between the classifiers, demonstrating that these performance metrics do not conform to any uniform distribution. KNN provided an average precision that was slightly above 96.16%, with a tight confidence interval of 93.18–99.14%, indicating that KNN shows more stable and robust performance when identifying stress markers. Thus, all three classifiers can be described as having similar metrics in reporting accuracy—with minor differences in measures of precision for the KNN classifier vs. the others.

To deliver a more thorough analysis of our findings, we computed the average accuracy across all classifiers, integrating the outcomes of each participant for every electrode. The overall findings are depicted in Figure 8, illustrating a clear trend: as noted in previous analyses, the optimal classifier differs for each electrode, with LDA attaining the highest accuracy on TP10 at 84.05%. Similarly, for SVM and KNN, the highest accuracy was achieved (83.98% and 83.92%) using TP10. Furthermore, TP10 demonstrated the highest overall average accuracy among all classifiers at 83.98%. These results highlight the TP10 electrode’s importance as a crucial element in our research. The findings demonstrate the remarkable accuracy of the TP10 electrode, showcasing its potential for comprehensive analysis and informing our investigation into enhancing stress detection effectiveness.

Similar to Figure 7, in Figure 9, we compared the maximum metric values across all models and electrodes for each participant in Scenario 2. Basically, we analyzed Table 5, Table 6, Table 7 and Table 8, and we recorded, for each participant, the model and the electrode that attained the maximum value for each of the four adopted performance metrics. Participants 1 and 2 consistently showed a preference for AF8, while Participants 4 and 5 showed a preference for TP10 with three out of the four adopted metrics. It is also apparent that, on an individual level, we might be able to find a model that is more indicative than other models, e.g., KNN for Participant 3 and LDA for Participant 1. We believe that, following this analysis, a stress detection system can be tuned or customized to perform better on an individual level.

4.4. Electrode Fusion Performance Analysis Across Diverse Stress Scenarios

4.4.1. Electrode Fusion Analysis of Scenario 1

To evaluate if the combination of multiple electrodes could improve the classification performance for Scenario 1, we progressively paired electrodes in descending order according to their individual accuracy, commencing with TP10, which exhibited the highest accuracy. At first, TP10 was integrated with AF8 to capitalize on the advantages of both electrodes, assessing performance based on accuracy. Subsequently, TP9 was included in the ensemble (TP10, AF8, and TP9) to ascertain if the integration of an electrode with a lower individual performance could enhance classifier efficacy. We further investigated the combination of TP10, AF8, and AF7 to assess the effect of incorporating AF7. Ultimately, the four sensors (TP10, AF8, TP9, and AF7) were put together to determine their collective impact on overall classification accuracy. Figure 10 summarizes the consolidated average results for all electrode arrangements, indicating a clear trend: each sensor configuration attained its peak accuracy with a different classifier. Significantly, SVM in conjunction with TP10 attained the highest individual accuracy of 95.76%, whereas TP10 exhibited the highest overall average accuracy among all classifiers at 91.42%. These findings emphasize TP10’s essential role in the classification framework, as illustrated in Figure 8, where its significant impact on stress detection accuracy is apparent.

4.4.2. Electrode Fusion Analysis of Scenario 2

In this scenario, likewise, we attempted to attain optimal accuracy by methodically integrating electrodes in descending order of their accuracy and assessing every combination using three classifiers: LDA, SVM, and KNN. The procedure started with the TP10 and AF8 combination, for which we evaluated the classification performance among the participants, utilizing accuracy to ascertain its effect. The electrode combinations were subsequently broadened to encompass TP10, AF8, and TP9, then TP10, AF8, and AF7, and ultimately all four electrodes (TP10, AF8, TP9, and AF7). Figure 11 displays the aggregated accuracy values for each electrode combination, demonstrating that LDA attained the highest accuracy of 90.27% with the TP10, AF8, and TP9 combinations. This electrode grouping demonstrated the greatest overall average accuracy among all classifiers, achieving 88.05%, confirming its efficacy in enhancing stress classification performance.

This analysis reveals that the model and sensor configuration with the highest accuracy varied between the two scenarios, but both scenarios achieved a strong classification performance across different stress-inducing contexts. In Scenario 1, with MAE data as the training set and VR data as the test set, the SVM model attained the highest accuracy at 95.76%, while the TP10 sensor had the highest individual sensor accuracy at 91.42%, as shown in Figure 10. In Scenario 2, using VR data for training and MAE data for testing, the optimal configuration was achieved by combining the TP10, AF8, and TP9 sensors, reaching an overall accuracy of 88.05% for all classifiers. The highest performance using electrode fusion was achieved using LDA trained with the TP10, AF8, and TP9 sensors, yielding an accuracy of 90.27%, as shown in Figure 11. These results underline the significance of selecting optimal sensor combinations and adapting the model to various stress-inducing conditions to achieve the highest accuracy. Notably, TP10 consistently performed as the most reliable sensor across both scenarios, supporting the potential for developing robust, adaptable stress detection systems that can function effectively across different environments and contexts.

4.5. Comparative Analysis of Wavelet Families for EEG-Based Stress Detection

This section evaluates the efficacy of diverse wavelet denoising functions in improving stress detection reliability across various contexts. The analysis examines three notable wavelet families—Symlets, Daubechies, and Coiflets—to determine the most appropriate preprocessing method for EEG signal processing. To conduct a systematic assessment, we initially employed wavelet denoising on Scenario 1, utilizing MAE data for training and VR data for testing, specifically concentrating on the TP10 sensor due to its established reliability in previous analyses. Afterwards, we executed a similar evaluation in Scenario 2, altering the data arrangement by employing VR data for training and MAE data for testing, once more prioritizing TP10 due to its reliable performance. This methodical approach guarantees a thorough comparison of wavelet functions in enhancing EEG-based stress detection.

4.5.1. Symlets Wavelet Family

We examined the efficacy of different Symlets wavelet functions—sym1, sym2, sym3, and sym4—to assess their influence on EEG-based stress detection. Figure 11 provides a comparative analysis of these wavelets in Scenario 1 and Scenario 2, elucidating their relative performance. In Scenario 1, employing the MAE data for training and VR data for testing, the sym4 wavelet continuously surpassed the other Symlets functions, attaining the highest average accuracy of 91.42%, as illustrated in Figure 12a. This enhanced performance indicates that sym4 effectively maintains pertinent EEG signal characteristics, enabling classifiers to generalize stress patterns from MAE to VR environments with greater reliability.

Among the classifiers, SVM demonstrated superior performance when paired with sym4, achieving an accuracy of 95.76%. LDA and KNN exhibited competitive outcomes; however, their accuracy metrics were marginally inferior. The persistent superiority of sym4 across various classifiers demonstrates its efficacy in denoising EEG signals, enhancing stress classification reliability in stress contexts.

In Scenario 2, where VR data were applied for training and MAE data were applied for testing, the performance of various wavelet functions exhibited greater variability. In contrast to Scenario 1, where sym4 was the top performer, sym2 proved to be the most effective wavelet in Scenario 2, attaining the highest average accuracy, as depicted in Figure 12b. This indicates that sym2 more effectively preserved essential stress-related EEG characteristics in this scenario, which entails varying cognitive requirements. The LDA classifier, in conjunction with sym2, attained the highest accuracy of 87.27%, surpassing both SVM and KNN in this context. This outcome demonstrates that sym2 augments stress feature extraction, thereby complementing LDA’s classification methodology, which renders it especially efficacious for cross-scenario stress detection.

4.5.2. Daubechies Wavelet Family

The findings in Figure 13 offer a comparison of the Daubechies wavelet functions (db1, db2, db3, and db4) in Scenarios 1 and 2. This assessment aimed to identify the most efficient Daubechies wavelet for preprocessing EEG signals in stress detection and to compare its efficacy with previously examined wavelet families. In Scenario 1, the db4 wavelet attained the greatest average accuracy among the Daubechies wavelets, achieving 83.39%. This signifies that db4 was the most efficient in maintaining stress-related EEG characteristics, thereby improving classifier efficacy. Nonetheless, db4 did not exceed the performance of sym4 (91.42%), as illustrated in Figure 13a, thereby affirming sym4’s reliability as the superior wavelet selection for this context. In the analysis of classifier-specific performance, SVM attained the highest accuracy of 87.15% with db4, surpassing both LDA and KNN. LDA achieved an accuracy of 80.44%, whereas KNN attained 82.58% using db4. The results indicate that SVM, in conjunction with db4, exhibited the highest classification accuracy.

In Scenario 2, db2 was identified as the most effective Daubechies wavelet, attaining the highest average accuracy of 85.60% compared to the other wavelets, as depicted in Figure 13b. This indicates that db2 was a more effective EEG denoising method when trained on VR stressors and evaluated on structured cognitive stressors (MAE). Among the classifiers, LDA exhibited superior performance with db2, achieving an accuracy of 87.27%, thereby illustrating its resilience in adapting to cross-scenario stress detection. SVM attained an accuracy of 84.85%, whereas KNN reached 84.69% utilizing db2. The findings suggest that db2 presents a feasible alternative to sym2, which slightly excelled in Scenario 2 of the Symlets family with an average performance of 85.57%.

4.5.3. Coiflets Wavelet Family

Lastly, Coiflets3 (COIF3) was evaluated to examine its efficiency in enhancing EEG analysis and cross-stress context detection. To consolidate and compare our findings, we evaluated the highest-performing wavelets from each family based on the previously calculated average accuracy compared to COIF3. Figure 14 represents the findings of this comparison. For Scenario 1, as shown in Figure 14a, sym4 from the Symlets family consistently achieved the highest accuracy of 91.42%, outperforming db4 from the Daubechies family and COIF3 from the Coiflets family. For classifier-based comparison, the SVM classifier reached the highest accuracy of 95.76 using sym4, which was greater than those obtained by db4 (87.15%) and COIF3 (87.72%). These results highlight that sym4 remains the most robust wavelet denoising function for enhancing the performance of stress detection systems across various scenarios.

On the other hand, for Scenario 2, Figure 14b shows that COIF3 achieved an average accuracy of 85.08%, which is lower than those attained by db2 (85.60%) and sym2 (85.57%). For classifier-based comparison, the LDA classifier achieved the maximum accuracy of 87.27% using both db2 and sym2 wavelet functions, which was higher than the 86.05% reached by employing COIF3 for the same classifier.

5. Conclusions

This study examined the adaptability of ML-based stress detection methods across various stress-inducing contexts by tackling four principal research questions: (1) How well do ML models generalize when trained on data from one stress-inducing scenario and tested on another? (2) What are the implications for the volume of training data required? (3) Does the highest-performing ML model remain consistent across scenarios? (4) Does the optimal wavelet denoising function vary with different stressors? To address such questions, we established a systematic EEG data acquisition protocol leveraging MUSE-STM, gathering EEG signals from five undergraduate engineering students under two stress conditions: MAE and VR gaming. The efficacy of three classical ML models—LDA, SVM, and KNN—was assessed across various electrode configurations and wavelet denoising functions to ascertain their efficacy in stress detection across different scenarios.

Our results demonstrate that ML models can generalize across stress scenarios; however, performance is dependent upon electrode selection, classifier type, and denoising wavelet mother wavelet techniques. In Scenario 1, employing MAE data for training and VR data for testing, the SVM classifier attained the highest accuracy (95.76%) with the TP10 sensor, which also exhibited the highest average accuracy (91.42%) among all classifiers and participants. The Sym4 wavelet function was determined to be the most effective for this scenario, achieving an average accuracy of 91.42%. In Scenario 2, leveraging VR data for training and MAE data for testing, the greatest mean accuracy of 88.05% was achieved by integrating the TP10, AF8, and TP9 electrodes, averaging the results across all classifiers. Unlike Scenario 1, the optimal wavelet function in this particular scenario was db2, attaining a mean accuracy of 85.60%. LDA appeared to be the most effective classifier in this context, attaining a maximum accuracy of 90.27% when averaging results across all participants. The findings demonstrate the variability in classifier performance and wavelet function efficacy contingent upon the stress context, highlighting the necessity for adaptive methodologies in EEG-based stress detection.

The present research, despite its contributions, has numerous limitations that require attention in future research. The primary limitation is the limited sample size, with only five participants involved in the study. The restricted dataset limits the generalizability of the findings, necessitating a larger and more diverse sample to validate the results. This research was carried out in a controlled laboratory environment, which may not accurately reflect real-world stress conditions. Subsequent research should investigate authentic stress situations to improve ecological validity. A further limitation is the limited variety of stressors, as only two stress-inducing conditions (MAE and VR gaming) were investigated. Exploring a wider array of cognitive, emotional, and physical stressors would yield a more thorough comprehension of ML-based stress detection. This research exploited MUSE-STM with only four electrodes (TP10, AF8, TP9, and AF7). Future investigations may utilize higher-density EEG systems to examine other brain regions and enhance detection precision. Ultimately, the study concentrated on traditional ML models (LDA, SVM, and KNN), whereas subsequent research could explore deep learning methodologies (convolutional neural networks, transformers, and hybrid architectures) for automatic feature extraction and enhanced classification efficacy.

Subsequent research may focus on broadening the participant pool to improve model robustness and guarantee generalizability across varied populations. Furthermore, investigating extra stress-inducing factors beyond MAE and VR gaming would enhance the development of a broader stress detection framework. The integration of multimodal stress detection by incorporating EEG with physiological signals, including heart rate variability (HRV), skin conductance (EDA), and electrocardiogram (ECG) signals, may substantially enhance classification accuracy. Improving model adaptability via transfer learning and domain adaptation techniques would enable stress detection models to generalize more efficiently across diverse environments. Subsequent research should investigate the practical application of ML-based stress detection systems in workplace, educational, and healthcare environments to evaluate feasibility and implementation. Overcoming existing limitations and promoting future research avenues will be essential for creating scalable, resilient, and practical stress detection solutions.

Author Contributions

Conceptualization, A.A.-K. and O.A.; methodology, M.M., A.A.-K., and O.A.; software, M.M.; validation, A.A.-K. and O.A.; formal analysis, M.M.; investigation, M.M, A.A.-K., and O.A.; resources, A.A.-K. and O.A.; data curation, M.M.; writing—original draft preparation, M.M., O.A., and A.A.-K.; writing—review and editing, A.A.-K. and O.A.; visualization, M.M., A.A.-K., and O.A.; supervision, A.A.-K. and O.A.; project administration, A.A.-K. and O.A.; funding acquisition, A.A.-K. and O.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Information Technology Industry Development Agency (ITIDA), Egypt, under grant number ‘CFP202’.

Institutional Review Board Statement

The study was conducted and approved by the Ethics Committee of the Department of Electronics and Communication Engineering at the Arab Academy for Science, Technology, and Maritime Transport (reference number 67/23; October 2022).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to data privacy reasons. Some of the data can be found on GitHub at the following link: https://github.com/alkabbany/stress_det_in_the_wild (accessed on 18 February 2025).

Acknowledgments

The authors extend their heartfelt gratitude to Rojaina Mahmoud for her invaluable effort and dedication in data acquisition. Her commitment was truly inspiring and greatly appreciated, making this research possible.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Executive, H.S. Work-Related Ill Health and Occupational Disease in Great Britain. Available online: https://www.hse.gov.uk/statistics/causdis/index.htm (accessed on 18 February 2025).
Iqbal, T.; Simpkin, A.J.; Roshan, D.; Glynn, N.; Killilea, J.; Walsh, J.; Molloy, G.; Ganly, S.; Ryman, H.; Coen, E. Stress Monitoring Using Wearable Sensors: A Pilot Study and Stress-Predict Dataset. Sensors 2022, 22, 8135. [Google Scholar] [CrossRef] [PubMed]
Nekoei, A.; Sigurdsson, J.; Wehr, D. The Economic Burden of Burnout. 2024. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4856947 (accessed on 13 February 2025).
Koivusalo, S. Occupational Stress and Its Effects on Knowledge Workers’ Job Performance. Bachelor’s Thesis, Jyväskylä University of Applied Sciences, Jyväskylä, Finland, 2020. [Google Scholar]
Timotius, E.; Octavius, G.S. Stress at the Workplace and Its Impacts on Productivity: A Systematic Review from Industrial Engineering, Management, and Medical Perspective. Ind. Eng. Manag. Syst. 2022, 21, 192–205. [Google Scholar] [CrossRef]
Musazzi, L.; Mingardi, J.; Ieraci, A.; Barbon, A.; Popoli, M. Stress, microRNAs, and Stress-Related Psychiatric Disorders: An Overview. Mol. Psychiatry 2023, 28, 4977–4994. [Google Scholar] [CrossRef]
Mazure, C.M.; Husky, M.M.; Pietrzak, R.H. Stress as a Risk Factor for Mental Disorders in a Gendered Environment. JAMA Psychiatry 2023, 80, 1087–1088. [Google Scholar] [CrossRef]
Panicker, S.S.; Gayathri, P. A Survey of Machine Learning Techniques in Physiology Based Mental Stress Detection Systems. Biocybern. Biomed. Eng. 2019, 39, 444–469. [Google Scholar] [CrossRef]
Petrantonakis, P.C.; Hadjileontiadis, L.J. A Novel Emotion Elicitation Index Using Frontal Brain Asymmetry for Enhanced EEG-Based Emotion Recognition. IEEE Trans. Inf. Technol. Biomed. 2011, 15, 737–746. [Google Scholar] [CrossRef]
Attallah, O. ADHD-AID: Aiding Tool for Detecting Children’s Attention Deficit Hyperactivity Disorder via EEG-Based Multi-Resolution Analysis and Feature Selection. Biomimetics 2024, 9, 188. [Google Scholar] [CrossRef]
Savarimuthu, S.R.; Karuppannan, S.J.K. An Investigation on Mental Stress Detection from Various Physiological Signals. In Proceedings of the AIP Conference Proceedings, Salem, India, 25–26 February 2022; AIP Publishing: Melville, NY, USA, 2023; Volume 2857. [Google Scholar]
Attallah, O. An Effective Mental Stress State Detection and Evaluation System Using Minimum Number of Frontal Brain Electrodes. Diagnostics 2020, 10, 292–327. [Google Scholar] [CrossRef] [PubMed]
Pourmohammadi, S.; Maleki, A. Stress Detection Using ECG and EMG Signals: A Comprehensive Study. Comput. Methods Programs Biomed. 2020, 193, 105482. [Google Scholar] [CrossRef]
Giannakakis, G.; Grigoriadis, D.; Giannakaki, K.; Simantiraki, O.; Roniotis, A.; Tsiknakis, M. Review on Psychological Stress Detection Using Biosignals. IEEE Trans. Affect. Comput. 2019, 13, 440–460. [Google Scholar] [CrossRef]
Vos, G.; Trinh, K.; Sarnyai, Z.; Azghadi, M.R. Generalizable Machine Learning for Stress Monitoring from Wearable Devices: A Systematic Literature Review. Int. J. Med. Inf. 2023, 173, 105026. [Google Scholar] [CrossRef]
Mamdouh, M.; Mahmoud, R.; Attallah, O.; Al-Kabbany, A. Stress Detection in the Wild: On the Impact of Cross-Training on Mental State Detection. In Proceedings of the 2023 40th National Radio Science Conference (NRSC), Giza, Egypt, 30 May–1 June 2023; IEEE: Piscataway, NJ, USA, 2023; Volume 1, pp. 150–158. [Google Scholar]
Healey, J.A.; Picard, R.W. Detecting Stress during Real-World Driving Tasks Using Physiological Sensors. IEEE Trans. Intell. Transp. Syst. 2005, 6, 156–166. [Google Scholar] [CrossRef]
Saeed, S.M.U.; Anwar, S.M.; Khalid, H.; Majid, M.; Bagci, U. EEG Based Classification of Long-Term Stress Using Psychological Labeling. Sensors 2020, 20, 1886. [Google Scholar] [CrossRef]
Asif, A.; Majid, M.; Anwar, S.M. Human Stress Classification Using EEG Signals in Response to Music Tracks. Comput. Biol. Med. 2019, 107, 182–196. [Google Scholar] [CrossRef] [PubMed]
Rajendran, V.G.; Jayalalitha, S.; Adalarasu, K. EEG Based Evaluation of Examination Stress and Test Anxiety among College Students. Irbm 2022, 43, 349–361. [Google Scholar] [CrossRef]
Halim, Z.; Rehan, M. On Identification of Driving-Induced Stress Using Electroencephalogram Signals: A Framework Based on Wearable Safety-Critical Scheme and Machine Learning. Inf. Fusion 2020, 53, 66–79. [Google Scholar] [CrossRef]
Jebelli, H.; Hwang, S.; Lee, S. EEG-Based Workers’ Stress Recognition at Construction Sites. Autom. Constr. 2018, 93, 315–324. [Google Scholar] [CrossRef]
Kosch, T.; Hassib, M.; Buschek, D.; Schmidt, A. Look into My Eyes: Using Pupil Dilation to Estimate Mental Workload for Task Complexity Adaptation. In Proceedings of the Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 21–26 April 2018; ACM: Montreal, QC, Canada, 2018; pp. 1–6. [Google Scholar]
Lim, W.L.; Liu, Y.; Subramaniam, S.C.H.; Liew, S.H.P.; Krishnan, G.; Sourina, O.; Konovessis, D.; Ang, H.E.; Wang, L. EEG-Based Mental Workload and Stress Monitoring of Crew Members in Maritime Virtual Simulator. In Transactions on Computational Science XXXII; Springer: Berlin/Heidelberg, Germany, 2018; pp. 15–28. [Google Scholar]
Gaurav, A.R.; Kumar, V. EEG-Metric Based Mental Stress Detection. Netw. Biol. 2018, 8, 25–34. [Google Scholar]
Lahane, P.; Thirugnanam, M. Human Emotion Detection and Stress Analysis Using EEG Signal. Int. J. Innov. Technol. Explor. Eng. 2019, 8, 96–100. [Google Scholar]
Gupta, R.; Alam, M.A.; Agarwal, P. Modified Support Vector Machine for Detecting Stress Level Using EEG Signals. Comput. Intell. Neurosci. 2020, 2020, 8860841. [Google Scholar] [CrossRef]
Mane, S.A.M.; Shinde, A.A. Novel Imaging Approach for Mental Stress Detection Using EEG Signals. In Proceedings of Academia-Industry Consortium for Data Science, Advances in Intelligent Systems and Computing; Gupta, G., Wang, L., Yadav, A., Rana, P., Wang, Z., Eds.; Springer Nature: Singapore, 2022; Volume 1411, pp. 25–36. ISBN 9789811668869. [Google Scholar]
Peterson, S.M.; Furuichi, E.; Ferris, D.P. Effects of Virtual Reality High Heights Exposure during Beam-Walking on Physiological Stress and Cognitive Loading. PLoS ONE 2018, 13, e0200306. [Google Scholar] [CrossRef] [PubMed]
Mateos-García, N.; Gil-González, A.-B.; Luis-Reboredo, A.; Pérez-Lancho, B. Driver Stress Detection from Physiological Signals by Virtual Reality Simulator. Electronics 2023, 12, 2179. [Google Scholar] [CrossRef]
El-Habrouk, J. The Use of EEG and VR in Mental State Recognition. Ph.D. Thesis, Carleton University, Ottawa, ON, Canada, 2023. [Google Scholar]
Ahmad, Z.; Rabbani, S.; Zafar, M.R.; Ishaque, S.; Krishnan, S.; Khan, N. Multi-Level Stress Assessment from Ecg in a Virtual Reality Environment Using Multimodal Fusion. IEEE Sens. J. 2023, 23, 29559–29570. [Google Scholar] [CrossRef]
Kim, H.; Song, S.; Cho, B.H.; Jang, D.P. Deep Learning-Based Stress Detection for Daily Life Use Using Single-Channel EEG and GSR in a Virtual Reality Interview Paradigm. PLoS ONE 2024, 19, e0305864. [Google Scholar] [CrossRef]
Gomes, P.; Kaiseler, M.; Lopes, B.; Faria, S.; Queiros, C.; Coimbra, M. Are Standard Heart Rate Variability Measures Associated with the Self-Perception of Stress of Firefighters in Action? In Proceedings of the 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Osaka, Japan, 3–7 July 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 2571–2574. [Google Scholar]
Lee, B.-G.; Chung, W.-Y. Wearable Glove-Type Driver Stress Detection Using a Motion Sensor. IEEE Trans. Intell. Transp. Syst. 2016, 18, 1835–1844. [Google Scholar] [CrossRef]
Cinaz, B.; Arnrich, B.; La Marca, R.; Tröster, G. Monitoring of Mental Workload Levels during an Everyday Life Office-Work Scenario. Pers. Ubiquitous Comput. 2013, 17, 229–239. [Google Scholar] [CrossRef]
Vesisenaho, M.; Juntunen, M.; Häkkinen, P.; Pöysä-Tarhonen, J.; Fagerlund, J.; Miakush, I.; Parviainen, T. Virtual Reality in Education: Focus on the Role of Emotions and Physiological Reactivity. J. Virtual Worlds Res. 2019, 12, 1–15. [Google Scholar] [CrossRef]
Gevins, A.; Smith, M.E. Neurophysiological Measures of Cognitive Workload during Human-Computer Interaction. Theor. Issues Ergon. Sci. 2003, 4, 113–131. [Google Scholar] [CrossRef]
Fairclough, S.H.; Mulder, L.J.M. Psychophysiological processes of mental effort investment. In How Motivation Affects Cardiovascular Response: Mechanisms and Applications; Wright, R.A., Gendolla, G.H.E., Eds.; American Psychological Association: Washington, DC, USA, 2012; pp. 61–76. [Google Scholar] [CrossRef]
Bird, J.J.; Faria, D.R.; Manso, L.J.; Ekárt, A.; Buckingham, C.D. A Deep Evolutionary Approach to Bioinspired Classifier Optimisation for Brain-Machine Interaction. Complexity 2019, 2019, 4316548. [Google Scholar] [CrossRef]
Al-Shargie, F.; Tang, T.B.; Badruddin, N.; Kiguchi, M. Towards Multilevel Mental Stress Assessment Using SVM with ECOC: An EEG Approach. Med. Biol. Eng. Comput. 2018, 56, 125–136. [Google Scholar] [CrossRef]
Singh, P.; Singla, R.; Kesari, A. An EEG Based Approach for the Detection of Mental Stress Level: An Application of BCI. In Recent Innovations in Mechanical Engineering; Vashista, M., Manik, G., Verma, O.P., Bhardwaj, B., Eds.; Lecture Notes in Mechanical Engineering; Springer: Singapore, 2022; pp. 49–57. ISBN 9789811692352. [Google Scholar]
Mane, S.A.M.; Shinde, A. StressNet: Hybrid Model of LSTM and CNN for Stress Detection from Electroencephalogram Signal (EEG). Results Control Optim. 2023, 11, 100231. [Google Scholar] [CrossRef]
Selesnick, I.W.; Baraniuk, R.G.; Kingsbury, N.C. The Dual-Tree Complex Wavelet Transform. IEEE Signal Process. Mag. 2005, 22, 123–151. [Google Scholar] [CrossRef]
Subasi, A. EEG Signal Classification Using Wavelet Feature Extraction and a Mixture of Expert Model. Expert Syst. Appl. 2007, 32, 1084–1093. [Google Scholar] [CrossRef]
Al-Qazzaz, N.K.; Hamid Bin Mohd Ali, S.; Ahmad, S.A.; Islam, M.S.; Escudero, J. Selection of Mother Wavelet Functions for Multi-Channel EEG Signal Analysis during a Working Memory Task. Sensors 2015, 15, 29015–29035. [Google Scholar] [CrossRef] [PubMed]
Patil, S.S.; Pawar, M.K. Quality Advancement of EEG by Wavelet Denoising for Biomedical Analysis. In Proceedings of the 2012 International Conference on Communication, Information & Computing Technology (ICCICT), Mumbai, India, 19–20 October 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 1–6. [Google Scholar]
Saeed, S.M.U.; Anwar, S.M.; Khalid, H.; Majid, M.; Bagci, U. Electroencephalography Based Classification of Long-Term Stress Using Psychological Labeling. arXiv 2019, arXiv:190707671. [Google Scholar]
Jenke, R.; Peer, A.; Buss, M. Feature Extraction and Selection for Emotion Recognition from EEG. IEEE Trans. Affect. Comput. 2014, 5, 327–339. [Google Scholar] [CrossRef]
Zamanian, H.; Farsi, H. A New Feature Extraction Method to Improve Emotion Detection Using EEG Signals. ELCVIA Electron. Lett. Comput. Vis. Image Anal. 2018, 17, 29–44. [Google Scholar] [CrossRef]

Figure 1. Visual schematic of the proposed framework for cross-context stress detection.

Figure 2. Electrode positions used in the MUSE S™ headband [41].

Figure 3. Overview of experimental protocol stages and durations.

Figure 4. Schematic of the EEG signal preprocessing and processing pipeline showcasing the sequence of techniques used for signal processing and feature derivation.

Figure 5. Experimental setup for EEG data collection and analysis for one participant.

Figure 6. Scenario 1: Average accuracy of each sensor across the three classifiers.

Figure 7. Scenario 1: Maximum metric values across all models and electrodes per participant.

Figure 8. Scenario 2: Average accuracy of each sensor.

Figure 9. Scenario 2: Maximum metrics across all models and electrodes per participant.

Figure 10. Scenario 1: Average accuracy of all combinations.

Figure 11. Scenario 2: Average accuracy of all combinations.

Figure 12. A comparative analysis of various Symlets wavelet functions in (a) Scenario 1 and (b) Scenario 2.

Figure 13. A comparative analysis of various Daubechies wavelet functions in (a) Scenario 1 and (b) Scenario 2.

Figure 14. A comparison of the Coiflets wavelet and other wavelet functions that achieved the highest performance in (a) Scenario 1 and (b) Scenario 2.

Table 1. Scenario 1: Performance measures (%) of AF7 electrode.

Results	Partic. 1	Partic. 2	Partic. 3	Partic. 4	Partic. 5	Average
LDA
Accuracy	68.24	54.73	47.86	76.03	99.29	69.23
Precision	90.54	98.65	95.71	75.34	98.57	91.76
Recall	62.62	52.52	48.91	76.39	100	68.09
F1-Score	74.03	68.54	64.73	75.86	99.28	76.49
SVM
Accuracy	72.30	58.11	64.29	90.41	98.57	76.73
Precision	86.49	97.30	94.29	95.89	97.14	94.22
Recall	67.37	54.55	58.93	86.42	100	73.45
F1-Score	75.74	69.90	72.53	90.91	98.55	81.53
KNN
Accuracy	68.92	47.30	79.29	100	99.29	78.96
Precision	86.49	94.59	97.14	100	98.57	95.36
Recall	64.00	48.61	71.58	100	100	76.84
F1-Score	73.56	64.22	82.42	100	99.28	83.90

Table 2. Scenario 1: Performance measures (%) of AF7 electrode with confidence intervals and ANOVA results.

Metric	Classifier	Average	Standard Deviation	95% CI Lower Bound	95% CI Upper Bound	ANOVA p-Value
Accuracy	LDA	69.23	18.94	50.12	88.34	0.021
	SVM	76.73	16.15	59.24	94.22
	KNN	78.96	21.32	55.38	102.54
Precision	LDA	91.76	8.93	82.54	100.98	0.104
	SVM	94.22	6.54	87.43	101.01
	KNN	95.36	5.87	89.23	101.49
Recall	LDA	68.09	19.45	48.21	87.97	0.018
	SVM	73.45	17.32	55.67	91.23
	KNN	76.84	21.02	54.56	99.12
F1-Score	LDA	76.49	10.56	65.61	87.37	0.034
	SVM	81.53	9.87	71.34	91.72
	KNN	83.90	12.45	70.98	96.82

Table 3. Scenario 1: Performance measures (%) of AF8 electrode.

Results	Partic. 1	Partic. 2	Partic. 3	Partic. 4	Partic. 5	Average
LDA
Accuracy	99.32	85.81	50.00	67.12	92.14	78.88
Precision	98.65	100	100	100	84.29	96.59
Recall	100	77.89	50.00	60.33	100	77.64
F1-Score	99.32	87.57	66.67	75.26	91.47	84.06
SVM
Accuracy	97.30	87.84	80.71	77.40	94.29	87.51
Precision	98.65	100	98.57	100	88.57	97.16
Recall	96.05	80.43	72.63	68.87	100	83.60
F1-Score	97.33	89.16	83.64	81.56	93.94	89.13
KNN
Accuracy	95.95	79.05	84.29	79.45	94.29	86.60
Precision	98.65	100	100	100	88.57	97.44
Recall	93.59	70.48	76.09	70.87	100	82.21
F1-Score	96.05	82.68	86.42	82.95	93.94	88.41

Table 4. Scenario 1: Performance measures (%) of AF8 electrode with confidence intervals and ANOVA results.

Metric	Classifier	Average	Standard Deviation	95% CI Lower Bound	95% CI Lower Bound	ANOVA p-Value
Accuracy	LDA	78.88	20.68	64.50	93.26	0.001
	SVM	87.51	7.59	81.94	93.08
	KNN	86.60	9.25	80.39	92.81
Precision	LDA	96.59	7.34	91.65	101.53	0.003
	SVM	97.16	5.10	93.76	100.56
	KNN	97.44	6.00	93.51	101.37
Recall	LDA	77.64	20.71	63.22	92.06	0.002
	SVM	83.60	13.25	74.76	92.44
	KNN	82.21	14.26	73.02	91.40
F1-Score	LDA	84.06	14.46	74.62	93.50	0.001
	SVM	89.13	9.65	82.51	95.75
	KNN	88.41	11.35	80.87	95.95

Table 5. Scenario 1: Performance measures (%) of TP9 electrode.

Results	Partic. 1	Partic. 2	Partic. 3	Partic. 4	Partic. 5	Average
LDA
Accuracy	87.16	56.76	50.00	69.18	85.71	69.76
Precision	91.89	100	100	78.08	77.14	89.42
Recall	83.95	53.62	50.00	66.28	93.10	69.39
F1-Score	87.74	69.81	66.67	71.70	84.38	76.06
SVM
Accuracy	85.81	54.05	50.00	76.71	75.71	68.46
Precision	91.89	100	100	83.56	72.86	89.66
Recall	81.93	52.11	50.00	73.49	77.27	66.96
F1-Score	86.62	68.52	66.67	78.21	75.00	75.00
KNN
Accuracy	83.78	52.70	50.00	71.23	77.14	66.97
Precision	90.54	100	100	95.89	80.00	93.29
Recall	79.76	51.39	50.00	64.22	75.68	64.21
F1-Score	84.81	67.89	66.67	76.92	77.78	74.81

Table 6. Scenario 1: Performance measures (%) of TP9 electrode with confidence intervals and ANOVA results.

Metric	Classifier	Average	Standard Deviation	95% CI Lower Bound	95% CI Upper Bound	ANOVA p-Value
Accuracy	LDA	69.76	14.48	50.23	89.29	0.081
	SVM	68.46	10.39	52.23	84.69
	KNN	66.97	13.69	47.31	86.63
Precision	LDA	89.42	8.27	77.21	101.63	0.235
	SVM	89.66	12.23	71.89	107.43
	KNN	93.29	6.54	82.98	103.60
Recall	LDA	69.39	15.21	48.45	90.33	0.037
	SVM	66.96	15.98	44.92	89.00
	KNN	64.21	12.87	45.01	83.41
F1-Score	LDA	76.06	8.45	62.45	89.67	0.112
	SVM	75.00	10.34	58.69	91.31
	KNN	74.81	8.67	61.21	88.41

Table 7. Scenario 1: Performance measures (%) of TP10 electrode.

Results	Partic. 1	Partic. 2	Partic. 3	Partic. 4	Partic. 5	Average
LDA
Accuracy	64.19	91.89	68.57	96.55	82.86	89.32
Precision	81.08	100	95.71	98.61	81.43	96.93
Recall	60.61	86.05	62.04	94.67	83.82	86.34
F1-Score	69.36	92.50	75.28	96.60	82.61	90.78
SVM
Accuracy	70.27	93.24	92.86	93.84	80.00	95.76
Precision	85.14	100	91.43	93.15	84.29	97.30
Recall	65.63	88.10	94.12	94.44	77.63	94.59
F1-Score	74.12	93.67	92.75	93.79	80.82	95.85
KNN
Accuracy	58.11	79.05	93.57	89.73	84.29	89.17
Precision	85.14	100	95.71	98.63	85.71	97.28
Recall	55.26	70.48	91.78	83.72	83.33	84.58
F1-Score	67.02	82.68	93.71	90.57	84.51	90.22

Table 8. Scenario 1: Performance measures (%) of TP10 electrode with confidence intervals and ANOVA results.

Metric	Classifier	Average	Standard Deviation	95% CI Lower Bound	95% CI Upper Bound	ANOVA p-Value
Accuracy	LDA	89.32	12.47	80.29	98.35	0.031
	SVM	95.76	5.22	91.14	100.00
	KNN	89.17	11.69	80.94	97.40
Precision	LDA	96.93	5.94	91.86	100.00	0.102
	SVM	97.30	4.14	93.88	100.00
	KNN	97.28	4.89	93.27	100.00
Recall	LDA	86.34	10.26	78.10	94.58	0.045
	SVM	94.59	6.21	89.46	99.72
	KNN	84.58	11.33	75.20	93.96
F1-Score	LDA	90.78	8.73	83.81	97.75	0.022
	SVM	95.85	4.98	91.71	99.99
	KNN	90.22	9.36	82.54	97.90

Table 9. Scenario 2: Performance measures (%) of AF7 electrode.

Results	Partic. 1	Partic. 2	Partic. 3	Partic. 4	Partic. 5	Average
LDA
Accuracy	69.68	93.64	59.13	87.50	80.82	78.15
Precision	84.04	87.27	100	98.85	100	94.03
Recall	65.29	100	55.02	80.37	72.28	74.59
F1-Score	73.49	93.20	70.99	88.66	83.91	82.05
SVM
Accuracy	76.60	91.36	63.04	86.21	82.88	80.02
Precision	95.74	85.45	100	100	100	96.24
Recall	69.23	96.91	57.50	78.38	74.49	75.30
F1-Score	80.36	90.82	73.02	87.88	85.38	83.49
KNN
Accuracy	64.89	85.91	64.78	86.21	78.08	75.97
Precision	95.74	86.36	100	98.85	100	96.19
Recall	59.21	85.59	58.67	78.90	69.52	70.38
F1-Score	73.17	85.97	73.95	87.76	82.02	80.57

Table 10. Scenario 2: Performance measures (%) of AF7 electrode with confidence intervals and ANOVA results.

Metric	Classifier	Average	Standard Deviation	95% CI Lower Bound	95% Upper Bound	ANOVA p-Value
Accuracy	LDA	78.15	13.45	64.70	91.60	0.048
	SVM	80.02	11.23	68.79	91.25
	KNN	75.97	10.56	65.41	86.53
Precision	LDA	94.03	5.32	88.71	99.35	0.087
	SVM	96.24	4.56	91.68	100.00
	KNN	96.19	4.89	91.30	100.00
Recall	LDA	74.59	14.31	60.28	88.90	0.032
	SVM	75.30	12.87	62.43	88.17
	KNN	70.38	13.72	56.66	84.10
F1-Score	LDA	82.05	8.94	73.11	90.99	0.041
	SVM	83.49	7.65	75.84	91.14
	KNN	80.57	9.31	71.26	89.88

Table 11. Scenario 2: Performance measures (%) of AF8 electrode.

Results	Partic. 1	Partic. 2	Partic. 3	Partic. 4	Partic. 5	Average
LDA
Accuracy	97.87	100	82.61	76.44	58.90	83.16
Precision	100	100	93.04	88.51	100	96.31
Recall	95.92	100	76.98	71.30	54.89	79.82
F1-Score	97.92	100	84.25	78.97	70.87	86.40
SVM
Accuracy	97.87	100	81.74	74.14	56.85	82.12
Precision	100	100	92.17	85.06	100	95.45
Recall	95.92	100	76.26	69.81	53.68	79.13
F1-Score	97.92	100	83.46	76.68	69.86	85.58
KNN
Accuracy	96.81	100	78.70	68.97	58.22	80.54
Precision	98.94	100	93.04	86.21	100	95.64
Recall	94.90	100	72.30	64.10	54.48	77.16
F1-Score	96.88	100	81.37	73.53	70.53	84.46

Table 12. Scenario 2: Performance measures (%) of AF8 electrode with confidence intervals and ANOVA results.

Metric	Classifier	Average	Standard Deviation	95% CI Lower Bound	95% CI Upper Bound	ANOVA p-Value
Accuracy	LDA	83.16	14.75	70.34	95.98	0.042
	SVM	82.12	17.08	67.32	96.92
	KNN	80.54	14.21	68.29	92.79
Precision	LDA	96.31	3.79	93.19	99.43	0.089
	SVM	95.45	5.72	90.87	100.00
	KNN	95.64	4.89	91.63	99.65
Recall	LDA	79.82	16.24	66.40	93.24	0.038
	SVM	79.13	17.53	64.32	93.94
	KNN	77.16	16.99	62.81	91.51
F1-Score	LDA	86.40	11.73	76.65	96.15	0.047
	SVM	85.58	12.81	74.86	96.30
	KNN	84.46	13.29	73.52	95.40

Table 13. Scenario 2: Performance measures (%) of TP9 electrode.

Results	Partic. 1	Partic. 2	Partic. 3	Partic. 4	Partic. 5	Average
LDA
Accuracy	91.35	96.36	82.61	47.70	74.66	78.54
Precision	92.31	92.73	92.17	83.91	95.89	91.40
Recall	90.32	100	77.37	48.67	67.31	76.73
F1-Score	91.30	96.23	84.13	61.60	79.10	82.47
SVM
Accuracy	82.98	92.73	57.83	62.07	76.71	74.46
Precision	94.68	86.36	90.43	90.80	93.15	91.08
Recall	76.72	98.96	54.74	57.66	70.10	71.64
F1-Score	84.76	92.23	68.20	70.54	80.00	79.15
KNN
Accuracy	85.64	97.73	87.83	59.77	68.49	79.89
Precision	94.68	95.45	93.04	93.10	98.63	94.98
Recall	80.18	100	84.25	55.86	61.54	76.37
F1-Score	86.83	97.67	88.43	69.83	75.79	83.71

Table 14. Scenario 2: Performance measures (%) of TP9 electrode with confidence intervals and ANOVA results.

Metric	Classifier	Average	Standard Deviation	95% CI Lower Bound	95% CI Upper Bound	ANOVA p-Value
Accuracy	LDA	78.54	18.65	63.42	93.66	0.042
	SVM	74.46	14.21	63.16	85.76
	KNN	79.89	15.78	67.28	92.50
Precision	LDA	91.40	5.89	86.73	96.07	0.124
	SVM	91.08	5.32	86.87	95.29
	KNN	94.98	3.67	92.01	97.95
Recall	LDA	76.73	16.67	63.80	89.66	0.038
	SVM	71.64	15.52	59.74	83.54
	KNN	76.37	15.88	64.35	88.39
F1-Score	LDA	82.47	12.43	72.95	91.99	0.049
	SVM	79.15	11.76	70.10	88.20
	KNN	83.71	11.89	74.44	92.98

Table 15. Scenario 2: Performance measures (%) of TP10 electrode.

Results	Partic. 1	Partic. 2	Partic. 3	Partic. 4	Partic. 5	Average
LDA
Accuracy	75.53	97.73	74.35	89.08	83.56	84.05
Precision	89.36	98.18	99.13	86.21	83.56	91.29
Recall	70.00	97.30	66.28	91.46	83.56	81.72
F1-Score	78.50	97.74	79.44	88.76	83.56	85.60
SVM
Accuracy	69.68	98.18	76.52	88.51	86.99	83.98
Precision	95.74	98.18	98.26	82.76	94.52	93.89
Recall	62.94	98.18	68.48	93.51	82.14	81.05
F1-Score	75.95	98.18	80.71	87.80	87.90	86.11
KNN
Accuracy	62.77	99.55	80.87	90.80	85.62	83.92
Precision	95.74	100	95.65	90.80	98.63	96.16
Recall	57.69	99.10	73.83	90.80	78.26	79.94
F1-Score	72.00	99.55	83.33	90.80	87.27	86.59

Table 16. Scenario 2: Performance measures (%) of TP10 electrode with confidence intervals and ANOVA results.

Metric	Classifier	Average	Standard Deviation	95% CI Lower Bound	95% CI Upper Bound	ANOVA p-Value
Accuracy	LDA	84.05	9.67	74.38	93.72	0.018
	SVM	83.98	10.03	73.95	94.01
	KNN	83.92	14.56	69.36	98.48
Precision	LDA	91.29	5.76	85.53	97.05	0.022
	SVM	93.89	5.01	88.88	98.90
	KNN	96.16	2.98	93.18	99.14
Recall	LDA	81.72	9.44	72.28	91.16	0.041
	SVM	81.05	10.76	70.29	91.81
	KNN	79.94	13.31	66.63	93.25
F1-Score	LDA	85.60	6.83	78.77	92.43	0.029
	SVM	86.11	5.38	80.73	91.49
	KNN	86.59	7.67	78.92	94.26

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Attallah, O.; Mamdouh, M.; Al-Kabbany, A. Cross-Context Stress Detection: Evaluating Machine Learning Models on Heterogeneous Stress Scenarios Using EEG Signals. AI 2025, 6, 79. https://doi.org/10.3390/ai6040079

AMA Style

Attallah O, Mamdouh M, Al-Kabbany A. Cross-Context Stress Detection: Evaluating Machine Learning Models on Heterogeneous Stress Scenarios Using EEG Signals. AI. 2025; 6(4):79. https://doi.org/10.3390/ai6040079

Chicago/Turabian Style

Attallah, Omneya, Mona Mamdouh, and Ahmad Al-Kabbany. 2025. "Cross-Context Stress Detection: Evaluating Machine Learning Models on Heterogeneous Stress Scenarios Using EEG Signals" AI 6, no. 4: 79. https://doi.org/10.3390/ai6040079

APA Style

Attallah, O., Mamdouh, M., & Al-Kabbany, A. (2025). Cross-Context Stress Detection: Evaluating Machine Learning Models on Heterogeneous Stress Scenarios Using EEG Signals. AI, 6(4), 79. https://doi.org/10.3390/ai6040079

Article Menu

Cross-Context Stress Detection: Evaluating Machine Learning Models on Heterogeneous Stress Scenarios Using EEG Signals

Abstract

1. Introduction

2. Related Work

2.1. Stress Detection in Real-World Scenarios

2.2. Stress Detection in Controlled-Setting Scenarios

2.3. VR-Based Stress Detection and Management

3. Materials and Methods

3.1. EEG Dataset Description

3.1.1. Subject Recruitment

3.1.2. Relaxation Protocol

3.1.3. VR-Based Stress Induction

3.1.4. Mental Arithmetic Evaluation Task and EEG Acquisition

3.2. EEG Signal Analysis

3.2.1. EEG Signal Preprocessing

3.2.2. Feature Extraction

3.2.3. Classification

4. Experimental Results and Discussion

4.1. Experimental Setup

4.2. Performance Metrics

4.3. Electrode-Specific Performance Analysis Across Diverse Stress Scenarios

4.3.1. Scenario 1: MAE Data as Training Data and VR Data as Testing Data

4.3.2. Scenario 2: VR Data as Training Data and MAE Data as Testing Data

4.4. Electrode Fusion Performance Analysis Across Diverse Stress Scenarios

4.4.1. Electrode Fusion Analysis of Scenario 1

4.4.2. Electrode Fusion Analysis of Scenario 2

4.5. Comparative Analysis of Wavelet Families for EEG-Based Stress Detection

4.5.1. Symlets Wavelet Family

4.5.2. Daubechies Wavelet Family

4.5.3. Coiflets Wavelet Family

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI