1. Introduction
Emotions shape our daily lives, influencing our behavior, perception and cognitive processes. As such, understanding and recognizing emotional responses is vital for advancing human research. Emotions are complex physiological and psychological responses to internal or external stimuli, producing coordinated changes in bodily states, cognitive processes, and expressive behavior [
1]. Facial expressions represent one of the most immediate and informative channels through which these emotional responses manifest [
2]. These systems are traditionally grounded in the Facial Action Coding System (FACS) [
3], which links observable facial movements to underlying emotional processes. Modern implementations typically rely on image- [
4,
5,
6] or video-based algorithms [
7,
8,
9], to detect the facial action units (AU) associated with basic emotions such as happiness, fear, anger, and surprise. Although these approaches have achieved substantial progress, they present notable limitations, including privacy concerns related to video capture, sensitivity to lighting and occlusion, and reduced performance when facial expressions are subtle or partially obstructed [
10,
11,
12,
13].
To overcome these limitations, there is growing interest in physiological approaches to emotion detection. In particular, facial electromyography (fEMG) offers a way to measure the underlying muscle activation associated with expressive behavior [
14,
15,
16]. Because fEMG captures neuromuscular signals directly, it can detect affective responses even when visible facial features are partially or fully obscured and offers a privacy-preserving alternative to camera-based methods [
17]. Prior works [
18,
19,
20,
21,
22] have shown that fEMG signals exhibit high signal-to-noise ratios and correlate strongly with both spontaneous and intentional affective expressions, highlighting their potential for continuous emotion monitoring.
Immersive virtual reality (VR) is increasingly used to elicit an ecologically valid range of emotional responses [
23,
24,
25,
26,
27]. Through standardized and interactive scenarios, VR can evoke a broad range of context-dependent affective and workload-related states, including stress, cognitive effort, and temporal pressure, while enabling repeatable experimental conditions [
28,
29,
30,
31]. However, the occlusion of facial features by VR headsets limits the reliability of vision-based facial expression analysis, creating a need for alternative sensing modalities that remain robust during immersive interaction [
32].
To capitalize on VR’s advantages and address the limitations of its use with vision-based emotion detection, the use of fEMG sensors with VR is crucial, enabling unobstructed measurement of facial muscle activity during immersive experiences. fEMG has been shown to reliably reflect affective valence and basic emotional responses by providing a direct measure of muscle activity associated with expressive behavior, with studies demonstrating its robustness to subtle or masked expressions [
33,
34]. Early studies showed that activity in muscles such as the zygomaticus major reliably reflects affective valence and basic emotional responses [
35]. Later work expanded to multi-channel fEMG systems, extracting handcrafted time- and frequency-domain features for distinguishing discrete emotions [
36]. Despite its advantages, fEMG-based emotion detection faces several challenges. There is no standardized protocol for acquisition or analysis, resulting in variability across studies. Most existing work focuses on basic, discrete emotions and does not capture the nuanced affective states present in naturalistic tasks [
37]. fEMG signals also exhibit substantial inter-individual variability in muscle anatomy and activation patterns, necessitating personalized modeling approaches for reliable decoding. Contextual factors—such as task demands, environment, and cognitive load—further modulate facial muscle activity, complicating interpretation [
33]. Moreover, conventional modeling and analysis pipelines often rely on traditional feature-engineering approaches that struggle to capture how expressions evolve over time, especially during cognitively or emotionally dynamic VR tasks [
38].
These challenges are amplified in immersive VR environments, where emotional, cognitive, and physical demands evolve continuously, producing complex spatio-temporal activation patterns across facial muscles [
39,
40,
41]. Effective modeling of VR-based fEMG therefore requires analytical frameworks capable of capturing temporal dependencies, multi-muscle coordination, and individual variability. Spatio-temporal deep learning approaches offer a principled solution by enabling automatic extraction of meaningful patterns from high-dimensional physiological data while preserving temporal structure and inter-channel relationships.
Building on this framework, the National Research Council Canada (NRC) has previously developed bWell, an interactive and immersive VR platform designed as a broadly applicable toolkit delivering multisensory tasks, targeting general aspects of cognition and everyday functioning [
42,
43,
44]. This platform is currently being utilized within a larger program of research aimed at integrating VR with multimodal physiological data for systematic and individualized understanding of stress response. The comprehensive methodology for the development of this platform for this target application, including design, implementation and exercises, are described in detail in our previous publication [
45]. In a preliminary pilot study, multimodal data—including heart rate, fEMG, perceived stress, and demographic information—were used to develop supervised models predicting subjective task load during VR-based stress exposure [
46]. These findings identified facial expression-derived fEMG features among the strongest predictors of perceived workload, motivating the present study’s focused analysis of fEMG signals and exploratory development of an end-to-end processing and modeling pipeline grounded in FACS-informed interpretation.
In this work, we present a unified framework integrating immersive VR, multi-channel fEMG, and a Convolutional Neural Network–Temporal Convolutional Network (CNN–TCN) spatio-temporal deep learning architecture for expression modeling and workload-related analysis. Within this architecture, the CNN component captures spatial patterns and inter-channel relationships in physiologically normalized fEMG signals, while the TCN models temporal dependencies and dynamic evolution of muscle activity over time. The approach leverages physiologically normalized signals and a shared model trained across participants to classify calibrated facial expressions from brief standardization trials. We then extend this framework to spontaneous, VR task-elicited fEMG recordings to infer continuous expression dynamics and extract static and temporal features characterizing expressive behavior across experimental scenes. Finally, we evaluate whether these fEMG-derived dynamics exhibit scene-dependent patterns and statistically robust associations with subjective workload ratings, while examining their consistency with FACS-aligned facial action patterns. The goal of this approach is not to generalize to a larger population but rather create a model calibrated with data from a specific cohort, allowing for precise inference within that same group. We hypothesize that this approach would be similar to a data encoding process, minimizing the necessity for a large cohort to develop a potentially broadly applicable model for predicting outcomes in new populations. Together, these contributions establish a scalable, privacy-preserving, and physiologically grounded foundation for automated expression monitoring and workload-related assessment in immersive VR environments.
2. Materials and Methods
In this study, participants are immersed in diverse virtual scenarios specifically designed to elicit a spectrum of emotions. These include basic emotion labels such as frustration and calm, as well as more nuanced labels that reflect the intersection of emotional and cognitive processes, such as determination, interest, anticipation, and task load. To effectively capture these emotional responses, this study utilizes both self-reports and continuous physiological measures. This data collection supports the development of a DL model.
A total of 12 participants were recruited for this study (6 females, 6 males). Female participants ranged in age from 29 to 45 years (M = 38.7, SD = 7.5), and male participants ranged from 25 to 56 years (M = 42.5, SD = 10.1). Participants were healthy volunteers with no known neurological disorder, head trauma, or other unstable medical conditions which may adversely impact cognitive functioning or presence of any physical mobility issues that limit use of VR. Personal characteristics data were obtained including age, sex, handedness, gaming activity and VR experience. The participants were recruited based on their motivation to participate in the study. Before the experiment, all participants received a detailed explanation of the experimental protocols. Subsequently, they approved and signed a written consent form. Ethics approval for the study was obtained from the Research Ethics Board (REB) of the National Research Council Canada (NRC 2020 and NRC 2023-46) and from Defence Research and Development Canada Human Research Ethics Committee (HREC 2022-032).
bWell has been designed with a suite of customizable tasks, allowing users to select tasks tailored to the unique needs of various fields or target populations. The platform has been adapted for specific applications, including multimodal assessment of stress response [
45,
46], cognitive remediation for depression [
47,
48], exploring memory across aging [
49], and integration with electroencephalography (EEG) [
50,
51]. The tasks used in the current study are briefly summarized in
Figure 1 and detailed as follows:
Tent—nature scene to collect data on user while immersed in VR but not required to perform a task. Used for resting/baseline state and recovery after a task.
City—passive viewing of a busy city with noise pollution—roads with ongoing traffic, including honking cars and sirens. Used to elicit an emotional response.
Mole—an exercise targeting response inhibition and cognitive control in which the user has a hammer in each hand and must hit cylinders that pop up in front of them (only when the cylinder and hammer are of the same color).
Stroll—an exercise administered as a purely physical demand (Stroll) and/or a dual task with both physical and cognitive demand (Stroll + CPT). To exert physical demand, the user must physically displace themselves to advance in the scene. The user was asked to jog, keeping their heart rate at a moderate level of exertion (the predefined threshold was displayed as a gauge based on individual heart rate). As a dual task, the exercise includes a secondary task targeting sustained attention with a continuous performance test (CPT) in which the user must press a button for each new shape that appears, except when it is a green diamond.
fEMG was collected using a commercially available system, the emteqPRO by Emteq Labs (Brighton, UK), which consists of a sensor mask insert for VR headsets [
52]. The mask is equipped with seven dry facial EMG electrodes (
Figure 2) to measure facial muscle activity detection, positioned at the (1) frontalis (eyebrow), (2) orbicularis (eye), (3) zygomaticus (mouth and cheek) and (4) corrugator (forehead). The signals are sampled at a frequency of 1000 Hz. In contrast to traditional camera-based systems, this technology can capture subtle facial movements in order to provide insights into a user’s affective state and enable application in virtual reality. In this study, the emteqPRO system was integrated into the HTC Vive Pro Eye VR head-mounted display (HTC Corporation, New Taipei City, Taiwan).
The emteqPRO focuses on key muscles identified in the FACS, which provides a comprehensive framework for identifying facial muscles involved in various expressions [
3], summarized in
Table 1. Specifically, it targets muscles such as the zygomaticus major, which is primarily responsible for pulling the lip corners upward, contributing to smiling. It also monitors the corrugator supercilii muscle that is involved in frowning, as it draws the eyebrows downward and together, creating vertical lines between the brows. The frontalis muscle, responsible for elevating the eyebrows, is targeted to capture expressions of surprise or attentiveness. Additionally, the orbicularis oculi muscle, which encircles the eyes, is key for facilitating eye closure, often associated with genuine smiles (Duchenne smiles) [
54]. By capturing the activity of these muscles, the emteqPRO can effectively monitor a subset of emotions outlined in the FACS, such as happiness, sadness and surprise.
The study was designed to systematically elicit specific emotional, physical, cognitive, and dual task demands by engaging participants in a series of VR environments, as illustrated in
Figure 3. In alignment with recent related work [
55], the study protocol began with an initial calibration phase (included in the emteqPRO software, Emteq SDK V2) where participants were asked to intentionally make specific facial expressions. This phase aimed to generate a high-quality, precisely annotated dataset of facial expressions suitable for training a predictive model. The facial expressions used in this study were limited to four expressions (smile, frown, raised eyebrow, and neutral), dictated by the standardized emteqPRO calibration protocol [
52,
53], and employed as expression-level markers rather than direct representations of complex emotional states. Further, participant-specific calibration enabled robust normalization across individuals. During the calibration, participants followed onscreen instructions prompting them to hold distinct expressions—such as maximum smile, maximum frown, and eyebrow raise (surprise)—for a few seconds each, while maintaining a neutral expression in between. During the experimental testing phase, participants were exposed to a sequence of VR environments designed to elicit distinct physiological and psychological responses. The session started with a neutral nature scene where participants were instructed to remain stationary, standing with their eyes open. This resting condition served to establish a 5 min baseline measurement of heart rate variability (HRV), consistent with established guidelines for HRV data collection and analysis [
56]. Following this baseline period, participants entered the main testing phase, which comprised a succession of interactive VR environments. Four active stressor scenes, each lasting two minutes, are designed to elicit specific demands: one primarily a cognitive load, another emotional stress, a third focusing on physical exertion, and a fourth combining both physical and cognitive demands, representing a dual task that mirrors real-world challenges. These environments were respectively the Mole, City, Stroll, and Stroll CPT exercises. To facilitate physiological and psychological recovery between active stress exposures, each stressor scene was followed by a two-minute presentation of a passive and tranquil VR environment depicting a natural setting devoid of targeted stimuli. This recovery condition was the Tent nature scene. Participants remained standing for all tasks except during the physical and dual cognitive-physical conditions, where they were instructed to run on a VR treadmill, as illustrated in
Figure 3. All participants experienced each of the active stressor environments; however, the order of presentation was pseudo-randomized across individuals to control potential sequence and carryover effects. This structured approach enabled the capture of both psychophysiological signals and facial expressions in a dynamic yet regulated setting, facilitating a comprehensive analysis of how various stressors uniquely affect emotional responses.
Participant demographic data and three standardized self-report questionnaires were administered to explore the effects of potential covariates and confounding factors. These were the Simulator Sickness Questionnaire (SSQ) [
57], the short version of the Game User Experience Satisfaction Scale (GUESS-18) [
58] and the System Usability Scale (SUS) [
59]. Simulator sickness was evaluated at baseline and after VR exposure. Facial expressions data were continuously collected throughout the whole testing phase using fEMG. Participants provided subjective mood (5-point emoji-based Likert scale) and workload ratings for each stressor task during the corresponding recovery periods. The workload was captured using the NASA Task Load Index (NASA-TLX) [
60], a widely used tool for assessing perceived workload across various dimensions. While it is not specifically designed to capture emotions directly, it can provide insights into the emotional states of participants by measuring the mental, physical, and temporal demands of a task, as well as perceived performance, effort, and frustration levels.
Figure 4 illustrates the distribution of participants’ self-assessed emotional responses, with the two types of self-report measures. After completing the session, participants reviewed a screen recording of their virtual task performance and provided qualitative feedback. They also completed the series of questionnaires to assess perceptions related to user experience and satisfaction, and system usability.
All data collected in this study were ingested and processed using a Python-based framework (v3.10.12) using PyCharm (v2025.3.3), developed by the NRC. This framework streamlines the aggregation of time-resolved, multimodal data and facilitates efficient feature extraction over custom-defined time intervals, thereby enabling subsequent statistical and deep learning analyses. The data processing pipeline used to construct the customized deep learning (DL) model comprised three main stages: (1) data tagging, (2) normalization, and (3) sliding-window segmentation.
Tagging: The emteqPRO provides filtered EMG data, free of noise and irrelevant frequency components providing a clean representation of muscle activity, and from which EMG amplitude is derived. The amplitude correlates with the degree of muscle activation; stronger muscle activation results in higher voltage amplitude. This measure provides direct insight into the power of fEMG activation at a given time for analyzing facial expressions and emotional responses. The data was tagged according to the experimental context:
Calibration data were labeled with the intentional facial expression elicited during each trial (smile, frown, eyebrow-raise, neutral).
Experimental, testing phase, recordings were tagged with scene-level metadata describing the emotional, cognitive, physical, and dual-demand VR stressors.
This tagging ensured that each segment of signal data was linked to the appropriate expression label or task condition for downstream modeling and statistical analysis.
Normalization: To account for participant-specific physiology and reduce inter-individual variability, fEMG must be normalized [
61]. Based on recommendations from earlier studies [
12], we applied two successive normalization procedures to ensure that the data is comparable across participants: Neutral correction and Maximum Voluntary Isometric Contraction (MVIC). Neutral correction specifically targets baseline differences in resting tone across participants and ensures that the signals reflect task-related activation rather than individual baseline drift. It is significant because it standardizes the starting point for each participant’s data, allowing for more accurate comparisons. MVIC normalization scales the muscle activation relative to individual maximum capacity, thereby reducing inter-individual variability in signal amplitude associated with differences in muscle strength and recruitment. This step is vital for ensuring that the fEMG amplitudes are comparable across individuals, as it accounts for physiological differences in muscle activation capabilities. Details on these two normalization procedures, including the specific steps, illustrations, and their effects, are provided as (
Appendix A.1).
Sliding-Window Segmentation: Sliding-window segmentation was used as a data-augmentation method by dividing each participant’s continuous normalized fEMG signals into fixed-length samples for model training, thereby increasing the number of training instances per expression. Each signal was labeled according to the elicited facial expression. Signals were divided into overlapping windows of length w with a stride of s; only complete windows containing the full number of samples were retained. In this work, the window length was determined empirically through pilot testing to balance signal stability and temporal resolution, and a zero-overlap stride was used to minimize redundancy and reduce overfitting from highly correlated samples.
Figure 5 presents an overview of the expression modeling pipeline, which integrates calibration-based normalization, model training, expression-feature extraction, and mixed-effects statistical analysis. In the calibration phase (Step 1 in
Figure 5), fEMG signals are normalized using participant-specific MVIC values computed from calibration recordings. All preprocessing and MVIC normalization were performed independently for each participant prior to cross-validation. Because MVIC normalization relies solely on each participant’s own calibration data and does not involve shared statistics or learnable parameters, it does not introduce data leakage. After normalization, calibration data from all participants are pooled to train a single shared CNN–TCN model that learns population-level patterns of facial muscle activation associated with different expressions. Model performance was evaluated using leave-one-participant-out (LOPO) cross-validation, ensuring that calibration data from the held-out participant were excluded from training in each fold. During the inference phase (Step 2), the finalized model, trained on the full pooled calibration dataset after cross-validation, is applied to normalized, unlabeled experimental fEMG recordings collected during the VR stressor scenes to generate continuous time-series of predicted facial expressions. These expression sequences are then transformed into a set of statistical expression-based features for each participant and scene (Step 3). Finally, these features are combined with NASA-TLX self-reported workload scores in a mixed-effects modeling framework to quantify how predicted expression dynamics relate to subjective workload across the emotional, cognitive, physical, and dual-demand VR scenes (Step 4). In the following, each component of the pipeline is described in detail.
Step 1 involves training the model using the calibration phase fEMG signals captured from seven channels, including the left and right orbicularis, left and right zygomaticus, left and right frontalis, and a single corrugator channel. During this phase, participants are explicitly instructed to perform a specific facial expression, including smiling, maintaining a neutral face, frowning, and raising eyebrows, used to label the data. The model is trained on the calibration dataset to learn the temporal and spatial representations of the four intentional facial expressions. By using the data obtained from the calibration phase to train the model, the model’s ability to interpret and classify facial expressions based on muscle activity is significantly enhanced. For training, we employed the LOPO strategy, where each participant’s data is used as a test set once, while data from all remaining participants are used for model development. Within each fold, the training portion was further split into training and validation subsets to enable hyperparameter tuning and early stopping. This process is repeated for each participant, ensuring that the model is evaluated on all participants’ data. Following this generalizability analysis, the model was retrained using the full calibration dataset, and the resulting model was used for all subsequent analyses in the second phase of the study. As illustrated in
Figure 6, the CNN-TCN model consists of three main components: CNN Block, TCN Block, and Fully Connected (FC) Block, with their configurations and hyperparameters provided as (
Appendix A.2).
In Step 2, the inference phase, the CNN–TCN model trained on the normalized calibration fEMG signals is applied to the full set of unlabeled normalized experimental fEMG recordings collected during the VR stressor sessions. Each participant’s continuous fEMG signals are segmented into windows using the same windowing procedure as in training, and the trained model produces a class prediction for each window corresponding to the most likely facial expression. These class predictions are then assembled into a continuous time-series sequence of expressions for each VR scene. This inference process allows us to estimate spontaneous, natural facial expressions as they unfolded throughout the emotional, cognitive, physical, and dual-demand VR conditions, without requiring any manual labels during the experiment. The resulting expression time-series served as the input for the subsequent feature extraction stage, enabling a detailed analysis of expression dynamics across different stressor conditions. It is worth mentioning that although participant-specific calibration recordings are used, they serve exclusively for MVIC-based normalization of facial EMG signals rather than participant-specific model training. After normalization, a single shared CNN–TCN model is trained on pooled data from all participants. The model therefore operates as a scene-agnostic, cross-validated framework within the current sample, but not as a calibration-free system.
In Step 3, after generating continuous time-series of predicted expressions in the inference phase, the next step involves deriving statistical features that summarize meaningful aspects of expression behavior. Depending on the application, the pipeline could end at expression prediction, for real-time expression recognition or monitoring. However, in this study, we aim to go beyond expression classification and investigate whether facial expression dynamics could reveal more nuanced patterns related to cognitive, emotional, physical, and mental workload. To do so, a comprehensive set of expression-based features is computed from the predicted facial expression time-series. Feature extraction is performed separately for each participant and each VR scene. In total, the continuous fEMG recordings were segmented into fixed temporal windows for model training, yielding a substantial number of spatio-temporal samples per participant. This windowing strategy increased temporal data density and supported stable optimization of the CNN–TCN model within the participant-independent evaluation framework. This extensive data helps mitigate the limitations typically associated with a small sample size and enhances statistical power. The features captured multiple aspects of temporal expression behavior and are grouped into four conceptual categories. The comprehensive list of features and details on the categories is provided as (
Appendix A.3). The resulting expression features served as inputs to the mixed-effects modeling stage, where their relationship to self-reported workload measured by NASA-TLX is examined.
In Step 4, to examine whether facial expression dynamics reflect the demands elicited by each VR stressor condition, a regression analysis is performed using the extracted expression features together with self-reported workload scores from the NASA-TLX. Mixed-effects model is chosen because it accounts for repeated measures within participants as a random effect and enables estimation of scene-specific effects while controlling for individual variability. For each expression feature, its relationship with the NASA-TLX subscales is modeled to determine whether specific aspects of facial expression behavior correspond to participants’ subjective experiences of mental, emotional, physical, or temporal workload. Since there are 8 NASA-TLX subscale scores, we performed corrections for multiple comparisons using False Discovery Rate (FDR) [
62]. This step aims to determine whether the predicted expression patterns meaningfully reflect the stressor by identifying expression features most strongly associated with each scene demand. In other words, given a VR scene designed to elicit cognitive effort, emotional frustration, physical exertion, or combined dual demand, we test whether the expression features respond in ways that aligned with the intended manipulation and are consistent with established FACS interpretations. Scene-specific associations (e.g., increased frown-related dynamics during cognitive tasks or increased lower-face activation during physical tasks) indicate that the model captures physiologically plausible responses tied to the nature of the workload.
3. Results
Before proceeding with the model-based analyses, we conducted a series of verification analyses to ensure that (i) the VR scenes successfully elicited the intended emotional and workload states and (ii) the recorded fEMG signals exhibited physiologically coherent and reliable activation patterns. These analyses included statistical evaluation of self-reported emotional ratings, NASA-TLX workload distributions across scenes, bilateral fEMG reliability testing, validation of expression-specific activation signatures during calibration, and examination of spontaneous scene-dependent fEMG modulation during the experimental phase. To maintain focus on the primary modeling outcomes in the main Results section, detailed statistical procedures, figures, and tables supporting these validation analyses are provided as (
Appendix A.4,
Appendix A.5 and
Appendix A.6). Briefly, the self-reported mood ratings via the emoji-based Likert scale showed significant variability across scenes, with a Friedman test revealing a significant main effect of scene, χ
2(4) = 10.93,
p = 0.027, and a Kendall’s W = 0.23 indicating a small-to-moderate effect size. Post hoc comparisons confirmed significant differentiation between the emotional and cognitive demanding scenes (p
adj = 0.011), validating the sensitivity of the scale to scene-induced emotional experiences. NASA–TLX subscale scores were examined to evaluate whether the four VR scenes elicited distinct workload profiles. Friedman tests indicated significant scene effects for mental, physical, temporal demand, performance, and effort, all with
p-values < 0.05, but not frustration, confirming that most of the VR exercises elicited distinct workload profiles consistent with their intended task demands. Post hoc Wilcoxon tests showed that mental demand was higher in the dual-task condition than in the purely physical condition, and that emotional demand scene was rated as less physically demanding than both cognitive and dual tasks, aligning with the intended task manipulations, validating the experimental manipulations. The continuous sensor-based measures were also verified. Spontaneous fEMG dynamics revealed distinct scene-specific muscle activation patterns, with significant increases in frontalis activity during the emotionally demanding scene (
p(FDR) = 0.05), and pronounced zygomaticus and orbicularis activation during physical (respectively
p(FDR) = 0.05 and 0.04) and dual-task (respectively
p(FDR) = 0.006 and 0.01) conditions, showing moderate to large effects. Empirically, muscle activations were not continuously sustained but instead appeared as transient peaks that rose and returned to baseline over time. This behavior highlights the inherently dynamic nature of facial muscle responses and underscores the need for temporal modeling approaches such as CNN–TCN.
3.1. Demographic and Questionnaire Data
The study featured a balanced gender distribution, with an equal number of female and male participants (6 each), which helps mitigate gender-related biases. However, the average age for male participants was slightly higher (M = 42.5) compared to female participants (M = 38.7). Age variations within this small sample could introduce variability in responses, potentially influenced by factors such as familiarity with technology or susceptibility to simulator sickness.
Participants reported limited routine gaming activity, with an average of 4.0 ± 7.6 h per week (range = 0–20). The most common level of VR experience was “Fair” (“I know of the types of VR—immersive, non-immersive, mixed reality, etc.”), reported by 5 of 12 participants (41.7%), followed by “Very good” (“I own a virtual reality headset”) and “Average” (“I’ve tried VR”), each reported by 3 participants (25%). Most participants used VR less than once a month (7 of 12; 58.3%), and the majority (11 of 12; 91.7%) described themselves as extremely comfortable with technology. Despite the variability in VR experience, it is noteworthy that none of the participants reported having “No experience with VR”. Coupled with the high level of technological comfort reported by the majority, we do not anticipate a significant impact on emotional responses solely due to variations in VR experience.
The SSQ assessed symptoms of cybersickness across three primary categories: nausea, oculomotor disturbances, and disorientation. Severity was determined by total SSQ scores: 0–5 (negligible), 5–10 (mild), 10–15 (moderate), and 15+ (severe) [
63]. The mean post-session SSQ score was 5.91 ±4.4, with most participants reporting negligible to mild symptoms, as scores fell within the range of 1 to 8. Two participants demonstrated significant symptoms of post-session, with scores of 11 and 14, but they also reported symptoms of pre-session, suggesting that these were not solely due to VR exposure. Overall, cybersickness did not appear to be a significant confounding factor in the study.
The SUS was used to assess whether there were any usability issues, such as difficulties navigating the interface, that might influence participant performance or emotional responses. The SUS yielded an overall mean score of 73.2 ±18.8, indicating above average usability, with scores above 68 generally reflective of good usability. This suggests that the observed effects are unlikely to be due to usability challenges, but rather primarily attributable to the experimental manipulations.
The GUESS-18 was used to examine how different aspects of the gaming might affect participant engagement or emotional responses. Excluding “social connectivity” due to the lack of participant interaction during VR exercises, the maximum score was adjusted to 56. Participants reported an overall positive experience, with scores ranging from 34 to 46, and a mean score of 38.5 ± 4.4. Good satisfaction across GUESS-18 dimensions suggests that observed effects are more likely due to experimental manipulations rather than dissatisfaction with the experience.
3.2. Calibration-Based Learning Phase Performance
Overall, the model demonstrated strong performance across training, validation, and held-out test participants. It was seen that a large proportion of expression segments were neutral; as such, model performance was quantified using both weighted and macro-averaged evaluation metrics to ensure that minority expression classes contributed proportionally to the overall assessment. The average classification metrics across all LOPO folds are summarized in
Table 2, including accuracy, macro/weighted F1-scores, macro/weighted precision and recall, and Receiver Operating Characteristic Area-Under-the-Curve (ROC-AUC). Test performance remained robust (accuracy ≈ 0.93 ± 0.07; macro-F1 ≈ 0.88 ± 0.13), indicating effective generalization to unseen participants. Training dynamics are illustrated in
Figure 7, which shows the evolution of the training and validation loss (left panel) and the weighted F1-score (right panel) over epochs. The rapid decrease in loss and corresponding increase in F1-score during the first epochs, followed by stabilization, confirms effective optimization and the absence of overfitting. These results indicate that the CNN–TCN architecture is capable of reliably extracting participant-specific facial expression representations from multichannel fEMG signals.
As shown in
Figure 8 (left), the LOPO evaluation on the test dataset revealed that the model achieved high classification reliability across all four expressions (smile = 0.94, frown = 0.81, raised eyebrow = 0.83, neutral = 1.00). smiles and neutral faces were identified with very high sensitivity, with only a small fraction of smile trials (6%) being confused with neutral and no other cross-confusions. Raised eyebrow expressions were also classified robustly, with the main source of error being confusion with frown (17%), which is consistent with the partially overlapping activation patterns of these expressions in the corrugator and frontalis channels. Frown showed the lowest recall and was misclassified as neutral (11%) or raised eyebrow (8%), but it nonetheless remained clearly separable from the other categories, indicating that the model still captured the characteristic brow-lowering activation pattern despite inter-individual variability. The ROC curves (
Figure 8, right panel) further highlight the classifier’s discriminative strength. All four expressions achieved AUC values above 0.84, and the curves for smile, eyebrow raised, and neutral approached ceiling performance. As shown in the right panel, the consistently high true-positive rates across a wide range of false-positive thresholds indicate robust separability of the underlying fEMG patterns.
After LOPO cross-validation, the CNN–TCN classifier was retrained on the full calibration dataset to produce the final model used for inference. This model demonstrated excellent performance across all four facial expressions, with weighted precision, recall, and F1-scores all exceeding 0.98. Per-class metrics were likewise high: smile (1.00/0.97/0.98), raised eyebrow (1.00/1.00/1.00), neutral (0.96/1.00/0.98), and frown (1.00/0.89/0.94) for precision/recall/F1-score, respectively. Most misclassifications reflected mild confusion with the neutral class, with 3% of smile trials and 11% of frown trials predicted as neutral, whereas raised eyebrow and neutral expressions were never confused with other categories. The overall accuracy of 0.98 further confirms the stability and generalizability of the model. These results indicate that the CNN–TCN architecture learned highly discriminative, physiologically meaningful representations of facial muscle activation, providing a reliable foundation for subsequent extraction of expression features during the VR experimental phase and their analysis via mixed-effects modeling.
3.3. Inference-Phase Expression Patterns Across Experimental Scenes
Since the experimental recordings did not contain ground-truth expression labels, inference-phase performance was assessed by characterizing scene-level expression patterns, evaluating physiological plausibility, and examining consistency with native Emteq expression indices. The model’s output, a continuous time-series of predicted expression classes, was summarized at the participant level by computing the proportion of non-neutral expressions in each scene and quantifying each participant’s change relative to their own baseline. As shown in
Figure 9, baseline-corrected non-neutral expressiveness clustered around zero during the emotional demand scene, was reduced during the cognitive demand scene, and increased markedly during the physical and dual demand scenes, with the largest median change and greatest inter-individual variability observed in the dual demand condition. This suppression of expressive activity under cognitive load, contrasted with the heightened expressiveness under physically and doubly demanding scenarios, is consistent with expected patterns of facial stillness during focused concentration and greater expressive engagement when tasks are more embodied or multimodal. Overall, these patterns indicate that the CNN–TCN model captures meaningful, scene-dependent variations in facial expressiveness even in the absence of labeled data.
To determine which expression categories drove these changes, for each trial and scene we computed the proportion of time classified as smile, frown, raised eyebrow, or neutral and subtracted that trial’s baseline proportions; the scene-averaged differences (Δ proportion vs. baseline) are shown in
Figure 10. Relative to baseline, the physical demand scene exhibited the largest reduction in neutral predictions (−0.17) together with a pronounced increase in smile (+0.14) and a smaller increase in raised eyebrow (+0.03), indicating that enhanced expressiveness in this condition is primarily driven by positive-valence mouth and brow activity. The dual demand scene showed a similar pattern, with a marked decrease in neutral (−0.14) and a corresponding increase in smile (+0.14), while emotional demand produced only subtle shifts (slight decrease in neutral and small increase in raised eyebrow, all |Δ| ≤ 0.04). In contrast, the cognitive demand scene displayed a modest increase in neutral (Δ = +0.05) accompanied by small decreases in smile and raised eyebrow, reinforcing the interpretation that participants maintained a relatively neutral, still facial posture during cognitively demanding tasks. Across all scenes, frown probabilities remained close to zero, suggesting that negative-valence brow contractions were rare in this protocol.
To evaluate the physiological plausibility of the model’s predictions during the inference phase, the scene-level distribution of predicted expressions was directly compared against Emteq’s native expression indices.
Figure 11 presents radar plots of expression proportions for each VR scene, showing side-by-side the CNN–TCN model outputs (blue) and the corresponding Emteq-derived expression estimates (red). Across all scenes, the model closely reproduced the dominant expressive pattern observed in the Emteq signal, namely, a strong predominance of neutral activity at baseline, modest increases in positive-valence expressions (smile and raised eyebrow) during the emotional and physical demand scenes, and minimal expressive output during the cognitive demand condition. The overlap of the blue and red indicates a high degree of correspondence, with the model tracking relative changes in scene-dependent expressiveness even though it was trained exclusively on calibration expressions and receives no ground-truth labels during the experimental session.
3.4. Scene-Dependent Associations Between Facial Expression Dynamics and Workload
To characterize the overall structure of the expression-derived feature space, we performed principal component analysis (PCA) on the 81 standardized expression features (see
Table 3) and retained the first two components, which together explained 41.6% of the variance (PC1: 24.5%, PC2: 17.1%). Inspection of the PC1 loadings showed that this component was dominated by global expressiveness metrics, including entropy, switch rate, neutral and raised eyebrow burst counts, and transitions between neutral, smile, and raised eyebrow states (e.g., entropy, switch rate, neutral burst count, raised eyebrow burst count, neutral to smile). PC1 therefore captures a general “expression dynamics” axis, reflecting how frequently participants moved away from a stable neutral face into any non-neutral expression. PC2, in contrast, was defined primarily by frown-related features (e.g., count frown, frown burst mean/median/max, frown burst count, frown ratio) together with smaller contributions from surprised and eyebrow-burst variability, indicating that it indexes a tension/frustration dimension driven by corrugator-like activity rather than positive valence.
Scene-wise boxplots of PC1 and PC2 (
Figure 12) and mixed-effects models with participant as a random intercept confirmed that these components varied systematically across conditions. For PC1, only the cognitive demand scene differed significantly from baseline (β = −3.93,
p = 0.005), showing markedly reduced expression dynamics, whereas emotional, physical, and dual demand scenes did not significantly depart from baseline, consistent with the impression that cognitive load suppressed overall facial expressiveness. For PC2, emotional demand elicited significantly higher scores than baseline (β = 2.34,
p = 0.031), whereas physical and dual demand scenes showed significantly lower scores (β = −2.23,
p = 0.040; β = −2.31,
p = 0.033), indicating that the emotionally evocative scene was associated with more frown-related tension, while the physically and dual-demand scenes were characterized by reduced frown activity. We performed this PCA step to reduce the high-dimensional feature set to a small number of interpretable latent dimensions that summarize global expressiveness (PC1) and frown-related tension (PC2). This allowed us to demonstrate that the model-derived expression features carry coherent, scene-dependent structure before moving to more detailed, feature-level and NASA-TLX-based mixed-effects analyses.
Building on the PCA structure, we next examined how individual expression features relate to subjective workload across scenes. To do so, we fitted a series of linear mixed-effects models in which each z-scored expression features predicted NASA-TLX ratings, with scene included as a fixed effect and an interaction term allowing feature–workload associations to vary by VR context. Participant-level random intercepts accounted for repeated measures and inter-individual differences in baseline workload. Models were estimated using restricted maximum likelihood with the Powell optimization algorithm, which yielded stable convergence across features. This modeling framework enabled us to quantify not only whether specific expressive behaviors were associated with subjective workload, but also whether the strength or direction of these associations was selectively modulated by task demands.
To facilitate interpretation, the extracted expression features were designed to capture both tonic and temporal characteristics of facial expression behavior within each scene. Specifically, the feature set includes summary descriptors of overall expression prevalence (e.g., ratios, counts, and burst-related measures) as well as sequence-level descriptors that reflect temporal organization (e.g., transition probabilities, latency indices, and entropy). All statistical inferences reported below are based on mixed-effects models with Benjamini–Hochberg FDR correction, and only associations surviving q < 0.05 are described (β coefficients and SE are reported for the fixed effects). Note that all reported relationships between facial expression dynamics and NASA-TLX dimensions are correlational in nature and should be interpreted as evidence of covariation and convergent validity rather than causal effects, due to the non-casual nature of regression analysis. Moreover, all expression predictors were z-scored prior to modeling; therefore, the reported β coefficients represent the change in NASA-TLX score (0–10 scale) per one standard deviation increase in the expression feature, enabling standardized comparison of effect sizes across features within each NASA-TLX dimension.
The CNN–TCN architecture served as the primary spatio-temporal modeling framework in this study, with subsequent benchmarking confirming its strong performance under LOPO cross-validation. The most consistent pattern—observed across both the CNN–TCN–derived features and Emteq-derived metrics—was a negative association between NASA-TLX physical demand and smile-related dynamics, indicating reduced smile persistence/expressivity under higher perceived physical workload. This convergent effect was observed across multiple dynamic and statistical smile-related features and across both modeling pipelines, suggesting a robust relationship between increased physical workload and reduced smile-related activity (broadly compatible with reduced FACS AU12-related engagement).
Additional FDR-surviving associations were observed between workload dimensions and eyebrow- and smile-related temporal dynamics in the emotional- and physical-demand scenes; however, these effects were less consistent across feature types and contexts. Overall, robust effects concentrated primarily on smile-related dynamics, with more selective modulation observed in eyebrow-related transitions. These effects may reflect modulation of expressive dynamics under affective challenge, though we interpret them cautiously given the exploratory, correlational nature of the analysis. Given the relatively small sample size and the number of tested associations across features and workload dimensions, all mixed-effects analyses reported here should be interpreted as exploratory and hypothesis-generating rather than confirmatory. Although FDR correction was applied to mitigate multiple-comparisons risk, these findings require replication in larger independent cohorts before firm conclusions can be drawn regarding the stability and generalizability of the observed relationships. Detailed feature-level statistics for all mixed-effects models are provided as (
Appendix A.7).
3.5. Model Comparison
To contextualize the performance and interpretability of the proposed CNN-TCN framework, we conducted a complementary comparison. We benchmarked the CNN–TCN model against alternative ML architectures (CNN, TCN, LSTM, CNN-LSTM and GRU) which are commonly used for affective computing and physiological time-series analysis. While traditional fEMG studies frequently rely on handcrafted feature extraction followed by classical classifiers (e.g., SVM, LDA), the present framework operates directly on raw fEMG time-series and learns spatio-temporal representations end-to-end. Implementing a feature-based baseline would require constructing a separate feature-engineering pipeline distinct from the raw-input modeling strategy adopted here. Therefore, we evaluated representative spatial (CNN), temporal (TCN, LSTM, GRU), and hybrid (CNN–LSTM) architectures under identical LOPO conditions within a unified raw-input framework.
For this purpose, all neural architectures were evaluated exclusively under supervised conditions using participant-specific calibration data with ground-truth expression labels. As such, the deep learning comparison reflects relative model capacity under controlled supervision rather than end-to-end deployment performance. As summarized in
Table 4, our proposed CNN–TCN achieves the strongest overall performance in terms of both Macro F1-score (0.882 ± 0.135) and Macro Recall (0.896 ± 0.107), indicating the most balanced sensitivity across expression classes under class imbalance. While the CNN baseline performs competitively, reflecting its effectiveness in capturing localized muscular activation patterns, its lower Macro F1 compared to CNN–TCN suggests limited access to longer-range temporal structure. In contrast, the TCN and recurrent models (LSTM, CNN–LSTM, GRU), which primarily emphasize temporal dependencies, exhibit reduced performance, likely due to weaker spatial feature extraction from fEMG signals. These results indicate that jointly modeling spatial muscle activations and long-range temporal dynamics, as realized by CNN–TCN, provides the most effective representation for calibrated facial expression learning in this dataset. Moreover, a closely related study using the same emteqPRO facial mask [
64] reported an F1-Macro score of 0.86 for five posed expressions (smile, frown, raised eyebrow, squeezed eyes, neutral) using handcrafted feature extraction. In contrast, our proposed CNN–TCN model achieved a slightly higher Macro F1-score of 0.882 ± 0.135 under LOPO cross-validation, while operating directly on raw fEMG time-series without manual feature engineering. Although the prior study included an additional “squeezed eyes” class, which increases classification complexity due to muscle overlap with frown, the comparable performance achieved by the end-to-end CNN–TCN model suggests that learned spatio-temporal representations can match or exceed feature-based pipelines on the same sensing hardware.
4. Discussion
The results confirm that the experimental protocol successfully elicited distinct emotional and workload states across VR scenes, as supported by both self-reports and continuous facial expression data. Emoji-based mood ratings and NASA–TLX scores showed significant scene-dependent variation, indicating effective manipulation of emotional and cognitive demands. In parallel, spontaneous fEMG activity exhibited physiologically coherent, muscle-specific modulation across scenes, supporting the validity of linking facial muscle dynamics with task context. Together, these findings demonstrate the feasibility of combining immersive VR with wearable fEMG to investigate cognitive–emotional dynamics in ecologically valid conditions.
Although self-report measures captured broad differences between scenes, they provided limited resolution for complex affective states influenced by stress, effort, and engagement. The convergence between workload ratings and fEMG-derived expression dynamics highlights the value of multimodal approaches for capturing subtle cognitive–emotional variations. Incorporating more granular affective scales and additional physiological modalities in future work may further improve sensitivity to nuanced emotional states.
These findings build on prior work demonstrating the feasibility of integrating facial EMG into immersive VR systems and using it for affect and expression monitoring [
52,
53,
55,
65,
66,
67]. While most previous fEMG–VR research has focused on discrete emotions or short events, relatively few studies have examined continuous expressive dynamics during cognitively demanding VR scenarios. The present work addresses this gap by characterizing temporally evolving expression patterns associated with workload and emotional context in immersive environments.
The proposed CNN–TCN framework achieved robust expression classification and generalized across participants and VR scenes, supporting the value of spatio-temporal deep learning for modeling coordinated facial muscle activity. This approach, through a calibration session where participants were asked to make intentional facial expressions, proved very promising despite being trained on only 12 participants. Our CNN-TCN model exhibited strong performance in detecting facial expressions, with high classification reliability. This finding is particularly noteworthy when comparing our results with Emteq’s native expression indices. Compared with traditional handcrafted feature pipelines and classical classifiers [
65,
68,
69], deep-learning approaches better capture complex spatial and temporal dependencies in multi-channel fEMG [
36,
66,
70,
71,
72]. We further evaluated transfer by applying the trained model to experimental fEMG recordings containing spontaneous, VR task-elicited expressions, and validated the dynamics through physiological plausibility and convergent associations with NASA-TLX workload measures. Scene-independent inference revealed physiologically meaningful variations in expression dynamics, and mixed-effects analyses identified statistically robust associations between expression-derived features and subjective workload, with FDR-surviving effects primarily related to perceived physical demand. These findings support the potential of deep-learning–based fEMG analysis as a continuous and objective complement to intermittent self-report measures in immersive VR.
While the results are promising, several limitations should be acknowledged. This initial pilot study is limited by its small sample size and the specificity of the VR tasks, which constrain the extent to which findings can be generalized beyond the current cohort. Although LOPO cross-validation, high-resolution windowed fEMG data, and regularized spatio-temporal modeling were employed to mitigate overfitting, the results should be interpreted as cross-validated within this sample rather than evidence of broad population-level generalization. External validation on larger and more diverse cohorts, including variation in age, ethnicity, and facial morphology, will be necessary before supporting wider deployment.
The facial expressions used in this study were limited to four expressions (smile, frown, raised eyebrow, and neutral), dictated by the standardized Emteq calibration protocol [
52], and employed as expression-level markers rather than direct representations of complex emotional states. Accordingly, the present framework should be interpreted as a proof-of-concept for mapping wearable fEMG-derived expression dynamics to cognitive workload, rather than as a comprehensive emotion recognition system. While we have implemented controlled conditions and advanced signal processing techniques to mitigate non-emotional influences, we acknowledge that factors such as physical exertion, breathing patterns, and mechanical artifacts due to movement within the immersive environment may still affect fEMG signals, particularly during physical and dual-task VR scenes. Future extensions incorporating richer expression vocabularies and multimodal signals may enable more direct modeling of higher-level affective constructs. Such extensions could include FACS-informed labeling of additional facial action units, expanded calibration tasks designed to elicit a broader range of expressions, and transferring learning approaches leveraging larger facial expression datasets to improve robustness and generalizability across users and contexts. In addition, incorporating more granular emotion scales, such as the Self-Assessment Manikin (SAM) [
73] or the Positive and Negative Affect Schedule (PANAS) [
74], could enhance our ability to capture the subtle complexities of emotional states, providing a more comprehensive understanding of the interplay between emotional experiences and experimental conditions. Furthermore, a formal a priori power analysis was not conducted, given the exploratory nature of the study and the lack of established effect size estimates for CNN–TCN–based fEMG modeling in immersive VR.
Future research will focus on expanding participant diversity, evaluating generalization across broader VR contexts, and transitioning toward real-time deployment. Integrating additional physiological signals and optimizing streaming inference will be essential for scalable implementation. Potential applications include adaptive training environments, therapeutic VR, and user-experience assessment, provided that appropriate safeguards for biometric data privacy and ethical use are maintained.