1. Introduction
The mental state, stress levels, work capacity, as well as the timing and methods of rest, are increasingly recognized as significant individual factors. Stress and compromised mental health may lead to physical illnesses, depression, loss of concentration and productivity, and strained social relationships. Consequently, proper rest and effective time management are crucial for maintaining quality of life and overall well-being. In recent years, systems designed to assess current psychological states have gained broader application: they support the diagnosis and monitoring of mental disorders (e.g., depression and anxiety), enable organizations to evaluate employee stress and psychological stability, and assist in identifying students’ needs to ensure that educators can offer appropriate support. Mental stress assessment has become increasingly relevant across industry, healthcare, education, security, and sports, where understanding how the brain responds to varying types and durations of stress is a shared priority. To investigate mental activity, multimodal systems are frequently employed to measure diverse parameters during cognitive task performance [
1]. Key physiological markers—including EEG, pulse and heart activity, oxygen saturation in the blood, and related autonomic measures—are utilized, as their variations can aid in evaluating stress factors and predicting their influence on the current mental state. The assessment of stress offers opportunities to prevent undesirable workplace situations and to enhance work and learning processes [
2].
Brain–computer communication systems have been developed for diverse scientific and medical applications, including restoring quality of life for individuals with motor disorders or neuromuscular injuries and enhancing performance in professional activities. Brain–computer interfaces (BCIs) based on electroencephalography (EEG) offer a non-invasive means of capturing brain activity through scalp electrodes. EEG is considered safe and provides high temporal resolution, making it suitable for real-time applications; it is extensively employed in BCI systems due to its cost-effectiveness, portability, and sensitivity to rapid changes in neuronal activity [
3]. To enhance performance and reliability, multimodal BCIs increasingly integrate EEG with additional physiological signals. In complex and dynamic environments, research has utilized combinations such as functional near-infrared spectroscopy (fNIRS), EEG, electrocardiography (ECG), and galvanic skin response (GSR) to examine stress and cognitive load [
4]. The integration of EEG with GSR (also known as electrodermal activity, EDA) constitutes a well-established approach for capturing both neural and autonomic activity; GSR reflects sympathetic modulation of skin conductance and correlates with arousal or stress responses that are largely outside conscious control. Extending EEG + GSR with infrared thermography (facial surface temperature) further broadens multimodal monitoring by incorporating a non-contact measure associated with autonomic regulation and peripheral perfusion. Such parallel sensing offers complementary physiological evidence that may enhance the assessment of cognitive and emotional states and increase robustness amidst real-world noise and individual variability.
While the motivation for multimodal stress and workload monitoring is strong, the recent literature shows that the field is methodologically diverse and often difficult to compare across studies. A “fusion-centric” systematic review of EEG-based stress research (covering studies up to early 2025) emphasizes the wide variability in pipelines, feature sets, fusion strategies, and validation practices, and repeatedly notes challenges in standardization and reproducibility [
5].
A complementary systematic review concerning mental workload in human–robot collaboration (HRC) similarly reports considerable heterogeneity even within a single application domain: EEG, cardiac measures, and EDA are frequently employed; ocular measures (such as fixations, saccades, pupil size, and blink metrics) have gained increasing prominence; and temperature measures continue to be relatively underutilized [
6]. These syntheses imply that advancement is contingent not solely upon the development of “better classifiers,” but also on more precise construct definitions (clarifying what is being measured), improved ground-truth methodologies, and more uniform reporting practices concerning synchronization, preprocessing, and quality assurance.
Evidence from applied domains demonstrates both the potential and the limitations of multimodality. In remote assistance for highly automated vehicles, time-resolved indicators such as eye metrics (e.g., pupil dilation), tonic electrodermal activity, and cardiac measures are responsive to workload variations through auditory n-back tasks; however, the accuracy of person-independent multi-class classification remains modest at approximately 58% for three classes, thereby emphasizing a persistent gap in generalization between within-person detectability and cross-person application [
7]. A related driving study employing auditory n-back during simulated driving integrates EEG spectral power, EOG/eye measures, ECG-derived HRV, EDA features, driving performance metrics, and NASA-TLX, reporting systematic subjective workload increases alongside modality-specific sensitivities to task difficulty [
8]. It is important to note that multimodality is not inherently advantageous: a comparative study utilizing a driving simulator has demonstrated that eye-gaze/pupillometry can surpass other modalities. Furthermore, the fusion of modalities may not enhance, and can even diminish, the performance of the most effective unimodal baseline. This suggests that supplementary sensors may introduce artifacts and misaligned temporal structures without providing additional valuable information [
9]. Conversely, a study on web-browsing workloads demonstrates that using a comprehensive multimodal set—comprising EDA, ECG, PPG, EEG, temperature, and pupil dilation—can achieve high-accuracy differentiation of workload levels, provided that protocol constraints and labeling are consistent with ocular responses under controlled lighting conditions [
10]. Collectively, these results suggest that the justification for fusion should be based on empirical evidence through ablation and complementarity analysis, rather than being assumed to be beneficial by default.
Despite the proliferation of available datasets, fundamental issues of labeling inconsistency and limited comparability remain largely unresolved, constraining the progress they are meant to enable. The ADABase dataset is a comprehensive multimodal dataset of cognitive load in autonomous driving paradigms. It integrates physiological and behavioral data such as ECG, EDA, EMG, PPG, respiration, skin temperature, eye tracking, and facial video/action units. Additionally, it includes subjective workload assessments (NASA-TLX), performance results, and a biochemical marker (salivary cortisol) [
11]. Its broad scope enables systematic examination of multimodal fusion strategies, missing data patterns, and distribution shifts across tasks. However, it also highlights a common challenge: workload “targets” are frequently composite constructs that combine subjective appraisal, performance, and physiological reactivity [
11]. The MMD-MSD dataset, collected in an office-like environment, combines synchronized vision with wearables (Empatica E4: BVP, EDA, skin temperature, accelerometry) and structured tasks like Stroop and posture changes, facilitating analysis that connects cognition, behavior, and ergonomics [
12]. These resources are valuable not only for benchmarking models but also for clarifying practices related to synchronization, annotation, and quality control, which directly impact reproducibility and generalization.
Recent modeling efforts focus more on short time frames and near-real-time inference. One method converts peripheral biosignals, such as ECG and EDA, into image-like formats, uses copula-based Granger causality graphs to model dependencies between signals, and employs a lightweight capsule network classifier. This approach demonstrates excellent results even with very brief windows, approximately 1 s, on standard datasets [
13]. While these results are promising, they also highlight a translational gap: methods validated on curated benchmark datasets do not necessarily apply to smaller, noisier, more diverse experiments unless sensor quality, missing data, and individual differences are explicitly addressed in both modeling and evaluation.
Within industrial HRC, neuroergonomics studies reveal both potential and complexity in integrating central and peripheral measures. A smart-factory protocol alters task difficulty (such as a Stroop-like primary task with a secondary task), robot speed, and payload, recording EEG band-power features and fNIRS hemodynamics. It also includes NASA-TLX and behavioral performance measures to determine if physiological indicators can estimate or replace traditional subjective and performance outcomes [
14]. This approach aligns with the HRC review’s focus on enhancing standardization and combining various evidence types to boost interpretability and practical deployment.
An additional interpretive issue is that physiological changes attributed to “workload” might be influenced by affective appraisal. A large driving simulator study connects high-risk scenarios to anger, fear, and anxiety and breaks down workload into visual, auditory, cognitive, and psychomotor components, showing that emotional intensity can vary together with workload indicators and thus complicate establishing ground truth and comparing results across studies [
15]. This is important whenever stress, arousal, and cognitive demand occur together rather than separately in experiments.
Building on this broader psychophysiological motivation, foundational BCI research has shown that non-invasive EEG can enable α/θ communication and control, serving as essential methodological components and validation methods for modern mental-state monitoring [
16]. Community benchmarks like the BCI Competition III tested alternative approaches under similar conditions, highlighting the importance of standardized evaluation for meaningful progress [
17]. Early successes with EEG-based control in non-invasive BCIs demonstrated the feasibility of translating brain signals into continuous control variables [
18], while later work on machine learning for real-time, single-trial EEG analysis connected traditional BCI goals with mental-state monitoring needs [
19]. Asynchronous paradigms—such as EEG-based virtual keyboards—underscored the importance of accurate state detection without fixed trial limits [
20]. Other asynchronous BCI uses, like brain-actuated wheelchairs and virtual environment wheelchair control, focused on robustness and continuous operation in realistic settings [
21,
22]. Ultimately, BCI control of robotic and prosthetic devices integrated these advances into a wider range of applications, supporting multimodal systems that combine EEG with autonomic measures for practical, real-world monitoring [
23].
Against this background, the present study is designed as a controlled, within-subject study to determine whether brief episodes of cognitive stress (mental workload) cause measurable physiological changes within the same individuals and whether tasks with different cognitive demands can be distinguished. A sequential within-subject protocol is used, consisting of five phases (rest 1, Stroop, rest 2, subtraction, rest 3), while five modalities are recorded simultaneously: EEG, heart rate (HR), galvanic skin response (GSR), facial surface temperature, and oxygen saturation (SpO2). In addition to phase-by-phase examination of time-series data, inter-individual similarity is analyzed using correlation coefficients, showing that correlations across participants can be high for some modalities but may also be negative, reflecting significant individual differences in baseline levels and response patterns. For inferential purposes, phase-based epoch means are assessed using one-factor repeated-measures ANOVA within an omnibus hypothesis framework: H0 assumes equal modality means across phases, while H1 suggests that at least one phase mean differs, with significance indicating a phase effect without specifying which phases differ unless tested with planned contrasts.
This positioning directly addresses several recurring gaps identified in recent reviews, datasets, and applied research: (i) it highlights within-subject detectability within a unified, sequential timing framework; (ii) it considers inter-individual variability as an explicit focus rather than leftover noise; and (iii) it includes under-characterized but practical channels—most notably facial thermography (facial surface temperature) and SpO2—which have not yet shown consistent incremental value beyond HR and GSR in the literature.
The study examines whether brief episodes of cognitive stress—defined as mental workload—cause measurable, systematic changes in physiological responses within the same individuals, consistent with standard psychophysiological methods for studying attention, emotion regulation, and stress reactivity. A multimodal, non-invasive sensing setup is used, allowing simultaneous recordings of EEG, heart rate (HR), galvanic skin response (GSR), facial thermography (facial surface temperature), and oxygen saturation (SpO2). The main goal is to see if these measures can distinguish resting periods from cognitive load periods, and if the overall multimodal response profile is sensitive to differences between the two task types (Stroop versus mental subtraction). For statistical testing, an omnibus repeated-measures design is applied separately to each modality across five sequential phases (rest 1, Stroop, rest 2, subtraction, rest 3): the null hypothesis (H0) states that phase means are equal, while the alternative hypothesis (H1) suggests at least one phase mean differs. As a comprehensive (omnibus) test, statistical significance indicates a phase effect but does not specify which phase pairs differ unless planned contrasts are used.
While extensive research has focused on multimodal workload and stress monitoring, studies that simultaneously combine EEG, HR, EDA/GSR, facial infrared thermography, and SpO2 within a single repeated-measures protocol—and that align different signals on consistent analysis intervals—remain relatively scarce. Therefore, this study acts as a pilot within-subject investigation, employing a unified timing scheme (rest 1–Stroop–rest 2–subtraction–rest 3) to analyze phase effects statistically and assess individual differences, rather than aiming to develop a universally applicable classifier.
3. Results
EEG analytical results, concerning averaged entropy H’ statistics are presented in
Table 1 and averaged alpha/theta ratios statistics in
Table 2. Additional generalized findings are organized into three subsections for clarity: (i) phase-wise descriptive statistics (mean ± SD) summarized in
Table 3, (ii) inferential effects within subjects using repeated-measures ANOVA with effect sizes and Greenhouse–Geisser correction, as shown in
Table 4, and (iii) exploratory comparison of cross-participant similarity through correlation ranges to highlight inter-individual differences. In participant-level plots, each curve represents one participant. If a participant’s data is missing in a figure, the caption specifies the number of included participants and provides the reason (e.g., sensor dropout or unusable segment after artifact screening) to clarify data availability and selection.
3.1. Facial Surface Temperature
Throughout the study period, the similarity in facial surface temperature between participants showed significant variation: the lowest pairwise correlation coefficient was −0.63, the average correlation was around −0.01 (indicating almost no overall agreement), and the highest correlation reached 0.83. This reflected clear differences between individuals in baseline levels and how they responded.
Figure 3 shows the values of the facial surface temperature of a large proportion of the participants for the entire period of the study.
The surface temperature of the face follows a behavior that is partially repeated as a type in different participants. It is clearly seen that at the end of the first break, the temperature decreased in most measurements. Peaks are also seen during mental workload during both tests (between the 7th and 13th measurements and between the 17th and 23rd). There are exceptions, of course, but there are noticeable trends for facial temperature to increase during mental exertion and decrease during rest. It can be seen that the speed of these temperature variations is different for different participants.
3.2. Heart Rate
Throughout the entire study, similarity in heart rate (HR) between participants varied significantly: the lowest pairwise correlation coefficient was −0.58, the average is 0.38, and the highest reached 0.88. This indicates notable differences among individuals in baseline levels and response patterns, even though some participant pairs show moderately consistent trends.
Figure 4 shows the heart rate values of almost all participants for the entire study period.
This measurement also clearly shows the increase in heart rate during the tests that require mental activity and the correspondingly lower values during the breaks. Again, there are differences, which are mainly visualized as different peak heights in the different participants. Measurement number 7 is the last of the first resting period. During the first test, peaks in the heart rate of some of the participants are clearly visible (measurements with numbers 8, 9, 12 and 13). During the second rest, drops in the heart rate levels of most participants are clearly visible (measurements with numbers 14, 15 and 16). Subsequent measurements with numbers from 18 to 23, which are during the second test and in the zone of the second mental load, again clearly show the increase in the heart rate levels and are visible on the diagram as peaks of different heights. At the end of the experiment during the last rest (measurements with numbers from 24 to 29), the diagram clearly shows the decrease in the heart rate values, which corresponds to the rest state. At the end of the experiment, the heart rate values fell to the relative values of heart rate at rest.
3.3. Oxygen Saturation
Across the full study period, cross-participant similarity in oxygen saturation (SpO2) showed wide dispersion: the minimum pairwise correlation coefficient reaches −0.74, the average correlation was approximately 0.02 (i.e., near zero overall agreement), and the maximum correlation reached 0.77, indicating substantial inter-individual differences in baseline levels and response dynamics.
Figure 5 shows the oxygen saturation values of most participants for the entire study period.
Oxygen saturation measurements show relatively stable values. Some of the diagrams show small drops during periods of mental stress, but the stable behavior of this value indicates its low informativeness in this type of research.
3.4. Galvanic Skin Response
Throughout the study, cross-participant similarity in galvanic skin response (GSR) showed significant variability, with pairwise correlation coefficients ranging from −0.58 to 0.90 and an average of 0.26. These values indicated substantial differences between individuals in baseline conductance and response patterns, and they were mainly reported as descriptive context rather than as evidence of a strong group-level effect.
Figure 6 visualizes the values of the galvanic skin response of all study participants.
3.5. Electroencephalography (EEG)
Additional correlation coefficients r* have been calculated [
27] for both epochs, regarding the E’ entropy. The aggregated findings concerning these analyses are depicted in
Figure 7 and
Figure 8, as follows:
As it is clear from the averaged entropy H’ results for both rests and cog. loads, the rests episodes are somewhat oscillating with all brain areas, giving higher values after each cognitive load episode. These dynamic changes are clearly observed for both parietal and occipital brain areas, whilst giving slight domination to the left hemisphere leads (P3, O1) vs. right ones (P4, O2). The lowest results are observed with central–right leads—C4, and both frontal and central left ones are showing higher entropy values.
The situation concerning both cog. loads’ episodes (load 1 and load 2) are giving almost equal dynamics for all recording leads, whilst giving some dominance to the left hemisphere (C3, P3 and O1 leads) in comparison to the right ones for the load 2 episode. It should also be noted here that the measured differences are not so high (in the interval of 0.2–0.8 from the overall H’ entropy dynamics).
As for the correlations’ r* dynamics, the aggregated results demonstrate a frontal dominant correlation, concerning the fronto-central zones during cognitive loads identification. In comparison, the parietal correlations, though still significant, are giving lower results for the right hemisphere. Similar results are also observed in the occipital zones. Regarding the rests’ parts of the experiment, the results are more dynamic, giving aggregated and synchronized responses to the frontal but left hemisphere (for F3 leads), whilst the right ones of the frontal and central zones demonstrate lower correlations (for leads F4 and C4), also adding the same response for C3 leads. In addition to these findings, the lowest correlations are also observed with the left parietal leads of P3 and partially occipital ones with O1. The rest of the responses from P4 and O2 are giving much better correlation performance. The most significant results are observed between the first two rests, and less between the second and third ones.
To complement the primary analysis and enable meaningful pairwise comparisons despite the limited sample size, additional statistical tests were applied to the averaged entropy results (H’). This includes the Mann–Whitney U test, Wilcoxon signed-rank test, paired permutation test, and two summary indices—mean and Combined Tendency Score (CTS)—as reported in
Table 1. These tests were included for completeness; however, the non-stationary nature of EEG signals limits the interpretability of entropy-based statistics when considered in isolation.
The findings presented above do not demonstrate a sufficiently clear discrimination between rest and cognitive load episodes, which is consistent with the well-established complexity of EEG signals [
29] and their sensitivity to emotional context [
31]. Nevertheless, the statistical tests reveal a moderate initial effect for the first cognitive load episode (load 1), interpretable as the subjects’ initial experimental activation, characterized statistically as a “moderate upward tendency.” This tendency persists following rest 2 and becomes most prominent in the third episode pair (load 2 and rest 3), with both episodes sustaining the upward trend.
As these results show some oscillation and are constrained by reliance on the alpha frequency band alone, the analysis was extended to incorporate the theta band. Following the approach of [
36], averaged alpha/theta ratios (ρ’ α/θ) were computed for all rest and cognitive load episodes and are presented in
Figure 9.
The obtained results have shown a clear discrimination capability within all EEG leads and experiment rests and cog. loads episodes, providing almost doubled averaged ratios, ρ’, between them, whilst keeping strong statistical significance and upward tendency.
3.6. Cross-Modal Interpretation of Phase Effects
To maintain a balanced within-subject design for testing, only participants with complete data across all five phases were included, resulting in nine balanced subjects. Because raw GSR values can vary greatly between individuals in baseline and dynamic ranges, the analysis also employed a processed GSR measure to improve comparability across subjects, based on subject-wise min–max normalization and baseline referencing to rest 1 (which becomes roughly zero by design). An optional smoothing step was added but was effectively turned off in the reported run (window length = 1).
The statistical analysis was executed using a standalone Python 3.12.2 script. For each modality (HR, GSR_rel, facial temperature, and SpO2), the script fits a one-factor RM-ANOVA model with phase as the within-subject factor (five levels: rest 1, Stroop, rest 2, subtraction, rest 3) and subject as the repeated-measures identifier. The output includes the F-statistic with its degrees of freedom, the uncorrected p-value, partial eta-squared as an effect size, Greenhouse–Geisser epsilon (εGG), and a GG-corrected p-value as a conservative approach for inference. The calculations are performed using stats models with AnovaRM.
In the RM-ANOVA, the inferential question is posed as a comprehensive test to determine whether the experimental phase has any consistent effect on a specific physiological variable. For each modality (HR, processed GSR, facial temperature, and SpO2), the same pair of hypotheses is tested:
Null hypothesis (H0): The mean value of the modality remains the same across all five phases (rest 1, Stroop, rest 2, subtraction, rest 3). Essentially, H0 suggests that any differences observed between phases are due to random variation rather than the experimental procedure.
Alternative hypothesis (H1): At least one phase has a mean value that differs from the others. Importantly, H1 does not specify which phase differs nor the direction of the difference.
As an omnibus test, significance indicates a phase effect without specifying which phase pairs differ. It is important to note that the reported standard deviations represent between-subject variability of the phase-wise epoch means (i.e., differences in baseline levels and individual reactivity), rather than measurement noise within epochs (
Table 3).
Regarding heart rate (HR), the average level increased during the cognitive load epochs compared to rest. The mean HR elevated from 77.841 ± 11.777 bpm at rest 1 to 80.630 ± 14.538 bpm during the Stroop task, returned close to baseline during rest 2 (77.889 ± 14.136 bpm), and reached its peak during the subtraction task (83.926 ± 14.532 bpm). The subsequent resting period (rest 3) reverted to values near the initial resting state (77.167 ± 13.037 bpm). Although mean differences between phases are modest relative to the relatively large standard deviations (indicating substantial baseline variability among participants), the overall pattern is consistent with increased cardiac activation during subtraction in this sample.
Regarding the processed galvanic skin response measure (GSR_rel), the phase-wise averages remained highly variable and did not exhibit a stable monotonic trend across phases. Rest 1 was approximately zero (~0 ± ~0), as the GSR metric is subject-normalized and baseline-referenced to rest 1. Relative to this baseline, the mean values were 0.0144 ± 0.1990 for Stroop, −0.0324 ± 0.2310 for rest 2, −0.2328 ± 0.4416 for subtraction, and 0.0442 ± 0.3982 for rest 3. The standard deviations were large relative to the means, indicating substantial heterogeneity in electrodermal responses across individuals and/or variability in the direction and magnitude of reactivity under cognitive load.
Facial temperature exhibited only modest fluctuations between phases, with mean values remaining confined within a narrow range. The average temperature was 35.270 ± 1.140 °C at rest 2 1, 35.296 ± 0.902 °C during Stroop, 35.447 ± 0.848 °C at rest 3 2, 35.322 ± 0.817 °C during subtraction, and 35.465 ± 0.902 °C at rest 3. Although rest 2 and rest 3 display slightly higher mean temperatures compared to rest 1, these differences are minimal relative to the between-subject standard deviations, consistent with slower thermal dynamics and prominent individual differences over the brief epochs examined.
Oxygen saturation (SpO2) demonstrates minimal fluctuations across phases. The mean SpO2 values were 96.095 ± 1.409% at rest 1 1, 96.259 ± 1.696% during Stroop, 95.861 ± 1.282% at rest 2, 95.981 ± 1.324% during subtraction, and 95.333 ± 1.074% at rest 3. Variability was comparatively low; however, mean differences across phases were also very small (less than one percentage point), supporting the interpretation of SpO2 as predominantly stable during short cognitive load tasks in healthy individuals.
The RM-ANOVA results quantify whether these apparent differences are statistically significant (
Table 4).
For heart rate, the phase effect was strong: F(4, 32) = 10.5862, p = 1.4037 × 10−5 (uncorrected), with a large effect size (ηp2 = 0.5696). The Greenhouse–Geisser correction yields εGG = 0.2616 and a corrected p = 0.01044, confirming the significance of the phase effect. This indicates that HR systematically varies across the five phases.
For the processed GSR_rel, the overall phase effect was not significant: F(4, 32) = 2.1798, p = 0.09368 (uncorrected), ηp2 = 0.2141, εGG = 0.3437, and pGG = 0.1661. Despite a moderate effect size, the sample size and variability do not support rejecting the null hypothesis that phase means are equal for this measure.
Facial temperature shows no evidence of a phase effect: F(4, 32) = 0.6532, p = 0.6289 (uncorrected), ηp2 = 0.07549, εGG = 0.3010, and pGG = 0.4665. These results suggest that temperature either changes slowly or is heavily influenced by individual differences during short epochs, rather than displaying sharp, phase-locked shifts.
For SpO2, the uncorrected test is marginal (F(4, 32) = 2.7331, p = 0.04605, ηp2 = 0.2546), but the Greenhouse–Geisser correction reduces the significance (εGG = 0.3382, pGG = 0.1209). Under a conservative approach, this indicates no robust phase-related modulation of SpO2, and any observed differences should be interpreted with caution.
Using a conservative inference rule based on GG correction, the overall conclusion is that a statistically significant phase effect is only present for heart rate, while no significant omnibus phase effects are observed for GSR_rel, facial temperature, or SpO2.
4. Discussion
Considering the pilot nature of this study with a small within-subject sample, the interpretations are intentionally conservative and based solely on empirically supported data from the current dataset. Effects related to phases are analyzed using a repeated-measures approach, while modalities beyond the primary autonomic measures are considered secondary or exploratory, due to notable individual differences and modality-specific responses during short epochs. Therefore, strong assertions about “reliability,” “validation,” or specific neurophysiological mechanisms are avoided unless backed by formal statistical evidence. The discussion instead highlights observed trends, variability, and how consistently effects appear under cautious inference.
The analysis of the collected physiological data (facial temperature, heart rate, oxygen saturation, and galvanic skin response) across ten participants reveals distinct patterns and inter-individual variability in response to cognitive load. The following observations, derived from the dataset, complement the EEG findings and provide a broader understanding of the physiological correlates of mental workload.
The findings presented here may be tentatively interpreted as reflecting a sustained neural response to cognitive engagement. Cognitive load episodes appear associated with increased EEG theta band activity, while rest states are characterized predominantly by increased alpha band activity—a pattern consistent with established accounts of task-related cortical dynamics. The observed dynamics of averaged entropy H’ can be interpreted within the same framework, as shifts toward higher frequencies during load episodes are known to correspond to increased signal entropy.
It should be noted that the analysis was intentionally restricted to the alpha and theta bands as simple, reliable, and statistically interpretable metrics for discriminating eyes closed rest states from eyes open cognitive load states. The remaining EEG frequency bands—beta and gamma—were not examined here but represent a natural avenue for deeper investigation of the underlying neural mechanisms driving the observed dynamic changes.
GSR time series displayed varied, participant-specific fluctuations during rest and task periods. Changes in amplitude varied significantly among individuals in both direction and size. Correlation analysis also showed wide variability, with pairwise coefficients ranging from −0.58 to 0.90 and an average of 0.26, indicating limited overall agreement and high dependence on each person’s baseline conductance and response patterns. Importantly, conservative rmANOVA analysis found no statistically significant overall phase effect for GSR. Therefore, GSR was considered an exploratory channel in this dataset and not viewed as a reliable group-level measure of cognitive load.
Heart rate (HR) shows a reliable increase during cognitive tasks.
A clear and consistent trend of increased heart rate was observed during both cognitive tasks (Stroop and arithmetic) compared to the rest periods. The data shows an average increase of approximately 5–15 BPM from resting baseline to the peak of cognitive effort.
The correlation coefficients for HR between participants were positive on average (0.38), with a maximum of 0.88. This suggests a more uniform directional response (increase during load) across the cohort compared to other measures like temperature or SpO2.
Observation: HR offers the clearest group-level evidence of phase-related modulation in this dataset, consistent with the expected cardiovascular response to cognitive stress.
Facial surface temperature demonstrates complex and biphasic patterns.
Contrary to a simple uniform increase, facial temperature dynamics were more complex. A common pattern observed in multiple participants was an initial decrease in temperature at the onset of a cognitive task, followed by a subsequent rise as the task continued.
The correlation between participants was the lowest on average (−0.01), with a wide range from −0.63 to 0.83. This indicates that while some participants showed strong correlated patterns (e.g., a clear temperature rise with load), others showed inverse or uncorrelated responses, making it a less reliable standalone metric without individual baseline.
Facial thermography detects dynamic physiological changes, but its response varies greatly among individuals in this protocol and should be interpreted cautiously when analyzing group data. The biphasic pattern (initial drop followed by an increase) may reflect rapid vasoconstriction followed by increased cerebral blood flow and metabolic heat.
Oxygen saturation (SpO2) shows minor fluctuations with low inter-participant consistency. The SpO2 levels remained within the normal physiological range (95–99%) in all participants throughout the experiment. While subtle decreases (1–2%) were sometimes recorded during cognitive tasks, these changes were not consistent across all subjects or all task periods.
This is supported by the correlation data, which shows an average correlation near zero (0.02) between participants, indicating no consistent, synchronized pattern of SpO2 change in response to the cognitive loads in this study.
Peripheral oxygen saturation appeared largely stable during short cognitive load epochs in this context and did not demonstrate significant phase-related modulation at the group level.
High inter-participant variability underscores the need for personalized assessment.
The calculated correlation matrices for all non-EEG parameters consistently show a wide spread of correlation coefficients (from strongly negative to strongly positive) for every possible pair of participants.
This finding strongly suggests that while the direction of change for parameters like HR is often consistent (e.g., HR generally goes up), the specific pattern, magnitude, and timing of each individual’s physiological response is unique.
The significant inter-participant variability across measured channels highlights challenges for “one-size-fits-all” interpretation and supports the use of within-subject designs and conservative inference in small-sample multimodal studies.
The RM-ANOVA results quantify phase-related modulation in the peripheral physiological measures within a rigorous within-subject framework. This methodology is highly appropriate for the current protocol, as it explicitly accounts for repeated measurements from the same individuals and separates systematic phase effects from substantial inter-individual differences in baseline levels and overall variability. In this dataset, the standard deviations of the epoch means indicate that between-subject dispersion is often substantial relative to phase-to-phase shifts in the means; thus, RM-ANOVA offers a principled method for testing whether the observed changes are consistent within participants across phases, rather than being influenced primarily by baseline differences among participants. A significant omnibus F-test supports the conclusion that at least one phase differs significantly in mean value, while a non-significant result suggests that any apparent differences are relatively small compared to within-subject consistency and residual variability at the group level. EEG-derived findings are regarded solely as supplementary contextual evidence for task engagement and are not incorporated into the formal inferential claims presented in this section.
4.1. Summary of Principal Findings
Among the autonomic measures, heart rate (HR) demonstrated the clearest sensitivity to task demands; it increased during both cognitive loads and was more pronounced during the more difficult arithmetic task. Galvanic skin response (GSR) showed variable, subject-specific task-related changes but did not exhibit a statistically significant omnibus phase effect under conservative correction and was therefore interpreted as exploratory in this dataset. Facial temperature did not exhibit strong phase-locked shifts across the group and instead followed slower, individual-specific patterns during the protocol. SpO2 remained relatively stable, as expected during seated cognitive tasks, thus serving as a physiological baseline. EEG was summarized by alpha/theta ratios across eight leads per epoch; this central measure is theoretically expected to decrease with increased workload, providing a complementary perspective to the peripheral measures.
4.2. Fast vs. Slow Channels
The data correspond with established time scales. Fast changes include HR (beat-to-beat and short windows), phasic GSR (event-related peaks), and EEG alpha/theta, which can fluctuate within seconds of task start. Intermediate changes involve tonic GSR, showing gradual drifts over minutes. Slow variations are seen in facial temperature, which accumulates over minutes rather than reacting immediately to task events. Understanding these time scales is essential for selecting appropriate windowing methods—using event-locked windows for HR, GSR-phasic, and EEG, and epoch averages and trends for temperature.
4.3. On Integrating Heterogeneous Modalities
Multimodal integration is valuable but should be used sparingly. No single channel suffices for all participants and tasks; combining central (EEG) with peripheral (HR, GSR, temperature) signals improves robustness and provides convergent evidence (in this dataset) (e.g., HR increase and GSR-phasic increase alongside EEG α/θ decrease during load). However, integration should happen in stages:
Normalize each subject individually (e.g., z-score or min–max); for GSR, separate phasic from tonic components and normalize to rest 1.
Extract compact epoch-level features per modality (e.g., HR mean and HRV; GSR peak rate/amplitude and tonic level; facial temperature mean and change from previous rest; EEG α/θ per region).
Perform late fusion (weighted z-sum or a small multivariate model) after validating each modality independently.
Conduct cross-modal checks (repeated-measures correlations, cross-correlation for lead–lag).
Beyond the omnibus phase tests, modality-specific patterns are described as exploratory observations without proposing extra post hoc hypotheses.
4.4. Role of EEG Alpha/Theta
The averaged alpha/theta ratios were calculated across eight leads per epoch to assess cortical states. Typically, α/θ ratios decrease with increased workload. In this dataset, the alpha/theta metric serves two complementary purposes: (i) confirmatory, by testing whether α/θ (rest) is greater than α/θ (Stroop) and α/θ (rest 2) exceeds α/θ (subtraction), preferably using a within-subject mixed model; and (ii) convergent, by examining correlations between α/θ and HR and GSR across epochs, expecting negative correlations with HR and phasic GSR during task load. Even if power is limited at α/θ bands, directional trends enhance the interpretation of results across multiple modalities.
4.5. Correlation Structure and Lead–Lag
Simple epoch-wise correlations are useful but should be supplemented with repeated-measures correlation to prevent inflation caused by between-subject differences and lagged analyses, such as cross-correlation, to investigate if EEG changes occur before autonomic responses. With n = 10, such timing assessments should stay descriptive but can inform future study designs, like using shorter windows around task onsets.
4.6. Limitations
The sample size is small (ten participants), which limits the ability to detect small effects and restricts the complexity of multivariate models. Epoch lengths are inconsistent, and the slow dynamics of temperature can obscure task contrasts. Large between-subject variability in GSR requires normalization. These limitations led to the use of mixed models, conservative feature sets, per-subject scaling, and planned contrasts.
4.7. Implications and Future Work
The results propose a practical approach for non-invasive workload assessment: use HR as a quick, reliable marker; include normalized phasic GSR to detect higher loads; utilize EEG α/θ for central validation; consider facial temperature as a slow-changing context indicator; and keep SpO2 as a measure of stability. Future research should involve larger samples, standardize event-locked windows, and pre-register concise feature sets. With increased data, it becomes possible to analyze a latent workload factor that combines EEG and autonomic features using Bayesian or SEM methods, and to evaluate how well the model generalizes through cross-validated predictions.
In summary, multimodal sensing is justified even at small scales when each modality is processed on its natural time scale, normalized per subject, and interpreted within a confirmatory mixed-model framework. HR and normalized GSR indicate load (and task difficulty), EEG α/θ provides central convergence, facial temperature reflects slow drift, and SpO2 confirms physiological stability, together supporting the viability of a non-invasive, multimodal approach to current psychological state assessment.
5. Conclusions
The correlations discussed in this study mainly serve as a descriptive complement to the within-subject phase analyses. Correlation coefficients, whether between different modalities or between participants for the same modality, do not inherently demonstrate discriminative ability or causal relationships. Nonetheless, they are valuable for understanding two key aspects of multimodal, non-invasive BCI monitoring. Firstly, they reveal how strongly various physiological channels tend to co-vary under a shared protocol, which helps determine whether modalities offer overlapping or complementary information. Secondly, correlations across participants offer a succinct way to illustrate the variability in temporal response patterns among individuals, highlighting a major challenge in creating generalizable mental-state monitoring systems and emphasizing the need for cautious interpretation and personalized baselining.
Combined with the phase-wise within-subject results, these descriptive correlation patterns suggest the broader potential of non-invasive multimodal BCI systems to detect relevant psychophysiological changes during brief cognitive load periods. Although the current pilot data do not allow for definitive population-wide conclusions, they indicate that multimodal sensing may help assess mental state when used carefully and within suitable analytical methods. Gaining a better understanding of how physiological responses vary together and differ across individuals could improve monitoring of everyday stress and workload, which are known to impact well-being, focus, and productivity. Consequently, these findings offer a structured empirical foundation for future, larger-scale studies on psychological stress, using more standardized data collection and predefined outcomes.
The study not only focused on peripheral sensing but also included an EEG processing pipeline designed to offer additional central nervous system insights into task engagement. EEG data underwent bandpass and mains-notch filtering, with alpha (8–13 Hz) and theta (4–7 Hz) bands extracted to generate compact, window-level descriptors across the eight recorded leads. To mitigate the effects of sparse, non-stationary artifacts common in this pilot setup, wavelet denoising was applied. The cleaned signals were then summarized using entropy-based descriptors and alpha/theta ratios, serving as practical metrics to compare rest (eyes closed) and cognitive load (eyes open) epochs. These EEG summaries were used cautiously as supportive context rather than definitive evidence of discriminative performance.
For the primary within-subject analysis, phase-wise epoch means were tested using a one-factor repeated-measures ANOVA with a conservative Greenhouse–Geisser correction. Based on complete-case data (n = 9), heart rate was the only modality showing a statistically significant omnibus phase effect across the five phases. In contrast, baseline-referenced GSR, facial temperature, and SpO2 did not reach significance after correction. Therefore, the main conclusion from this pilot data was that HR demonstrated strong phase-related changes under cognitive load, while the other peripheral measures should be considered exploratory or context-dependent, as their apparent variations were small compared to individual differences.
In order to build successful practices for studying the influence of cognitive stress, as well as methodologies for overcoming stressful conditions, taking into account the difficulty of such studies and the many influencing factors, we would like to point out the need for additional research.