In this section, we describe the VREED dataset, which is categorized into four unique classes. The preprocessing stage involved adapting the data for binary classification and addressing missing values and imbalances to maintain the dataset’s integrity. For the machine learning analysis, we chose the extra trees classifier due to its high predictive performance. Initially, SHAP and LIME were applied to the entire dataset to interpret the impact of individual features on the model’s predictions. Subsequently, principal component analysis (PCA) was employed to reduce the number of features by grouping them as blinks, saccades, fixations, and micro-saccades. SHAP and LIME were then reapplied to the reduced dataset. Each of these steps is detailed in this section.
3.1. Data Description
The present study utilized the publicly available VREED (VR Eyes: Emotions dataset), a multimodal affective dataset where emotions were elicited using immersive 360-degree video-based virtual environments (360-VEs) collected through VR headsets [
14]. VREED is particularly useful for addressing our research question due to several key factors. Firstly, it is one of the pioneering multimodal VR datasets specifically designed for emotion recognition, incorporating both behavioral and physiological signals, including eye-tracking, ECG, and GSR data. This multimodal approach provides a comprehensive view of emotional responses, making it well-suited for our goal of analyzing visual attention and gaze patterns in VR environments. Secondly, the dataset was meticulously curated, with environments selected based on feedback from focus groups and pilot trials, ensuring that the stimuli used effectively elicit the intended emotional responses. This rigorous selection process enhances the validity of the data, making it a reliable source for examining the correlation between eye-tracking metrics and emotional states. Additionally, VREED includes data from a diverse group of participants across various age ranges, which improves the generalizability of our findings. The balanced representation of different emotional quadrants within the CMA further strengthens the dataset’s relevance to our research.
While ECG and GSR data provide valuable physiological insights, in the present study, we chose to focus exclusively on eye-tracking metrics since our primary objective was to analyze visual attention and gaze patterns in VR environments. Eye-tracking data offers a unique and direct window into cognitive and emotional processes, providing rich insights into how individuals interact with virtual environments. Given that eye movements are closely linked to visual attention and can be precisely quantified, they serve as a reliable indicator of emotional states. Moreover, eye-tracking metrics such as saccades, fixations, and blinks offer specific and actionable data that can be directly correlated with emotional responses. This focus aligns with our goal of advancing the understanding of affective computing within VR by leveraging XAI techniques to interpret these specific metrics.
The eye-tracking, ECG, and GSR data were collected from 34 healthy participants, comprising 17 males and 17 females, aged between 18 and 61 years. All participants were required to sign a consent form and complete a pre-exposure questionnaire. They interacted with 12 distinct VEs selected based on a focus group and a pilot trial.
During the selection phase of VEs, the focus group—comprising six experts in human-computer interaction and psychology—identified the potential 126 VEs, which they manually found on the Youtube platform [
14]. The focus group selected 21 VEs out of all 126, adhering to specific exclusion criteria. Subsequently, a pilot trial involving 12 volunteers (6 males and 6 females) aged between 19 and 33 years further refined these 21 environments using two established psychological measurement tools: the Self-Assessment Manikin (SAM) and the Visual Analog Scale (VAS) [
48,
49]. SAM, utilizing cartoon-shaped manikins, helps visualize the arousal-valence dimensions of emotions. At the same time, VAS is a 0 to 100 linear scale for quantifying various emotional states such as joy, happiness, calmness, relaxation, anger, disgust, fear, anxiousness, and sadness. Since the objective of the pilot trial was to select three 360-VEs corresponding to each quadrant of the CMA, ensuring a comprehensive representation of emotional states, a total of 12 VEs were chosen for the final experiment according to SAM (arousal, valence) and VAS (joy, anger, calmness, sadness, disgust, relaxation, happiness, fear, anxiousness, and dizziness) ratings.
In the experiment phase, the eye-tracking, ECG, and GSR data were collected using the final 12 VEs, each trio representing one quadrant of the CMA and designed to evoke a range of emotional responses corresponding to the different quadrants. Each VE included diverse elements of sights, sounds, and activities to create immersive and engaging experiences. In certain environments, participants were exposed to stimuli such as monsters, zombies, or a possessed woman to induce emotions associated with high arousal and negative valence, such as fear, stress, or anger. To evoke emotions of high arousal and positive valence, like excitement, participants were immersed in scenarios where they walked on a tightrope or danced with exotic performers. For inducing low arousal and positive valence emotions, such as calmness and relaxation, participants encountered settings like a farm with bunnies, a forest with bird sounds, or various spas and tranquil locations around the world. Lastly, to elicit low arousal and negative valence emotions, such as depression and sadness, participants were immersed in environments depicting mourning scenes, refugee camps, or war zones. A total of 34 participants (17 males and 17 females) aged between 18 and 61 years volunteered for this phase after conducting a survey. 19 of them reported having used VR before. None have reported feeling motion sick amid or post-exposure to VR. Before the experiment, all participants signed a consent form and filled out a pre-exposure questionnaire, including the SAM and VAS. Then, they engaged with these environments in a randomized order, from which eye tracking, ECG, and GSR data were collected. They also repeated the SAM and VAS questionnaires after the exposure to VR. Initially, the interactions resulted in 408 trials; however, due to concerns about data quality and technical issues, the final dataset was narrowed down to 312 trials involving 26 participants.
Subsequent to data collection, the raw eye-tracking data were processed using the GazeParser library in Python [
50]. This processing extracted vital features such as fixation, micro-saccade, saccade, and blink. Fixation is defined as a temporary halt in eye movement; micro-saccades as minor; involuntary movements within a fixation; saccades as rapid movements between fixations; and blinks as periods when the eyes are closed. The researchers then computed sub-features for each main feature, utilizing statistical calculations like normalized count (NormCount), mean, standard deviation (SD), skewness (Skew), and maximum (Max). These eye-tracking features are detailed in
Table 1.
Additionally, the eye-tracking dataset includes a target column labeled ‘Quadrant Category’, which corresponds to a quadrant of the CMA for each trial.
Table 2 delineates the nominal categories associated with each CMA quadrant. In total, the dataset comprises 312 rows and 50 columns, capturing a comprehensive set of variables crucial for the subsequent analysis.
3.2. Data Preprocessing
In this section, we detailed the essential preprocessing steps undertaken to prepare the eye tracking data for a classification problem, aiming to isolate significant variables effective in predicting elicited emotions and their respective CMA categories.
Throughout the preprocessing phase, we utilized Python and Jupyter Notebook for data manipulation and analysis. Initial checks for missing values revealed 96 instances labeled as ‘not a number (NaN)’. These were imputed with the average values from their respective columns to maintain data integrity.
Given our goal to assess each quadrant category individually, we reformatted the ‘Quadrant Category’ target column, which originally contained four distinct values, each of which constructs a class, into a binary classification format for each category. This restructuring led to an imbalance in the dataset; for instance, quadrant category 0 contained only 78 trials, while the other categories comprised the remaining 234 out of 312 trials. Training models on such imbalanced data can introduce bias, as models tend to favor the majority class, thus misleadingly inflating accuracy metrics. To address this, we implemented the Synthetic Minority Over-sampling Technique (SMOTE) [
51], which effectively balanced the data by enhancing the representation of minority classes. After the application of SMOTE, a balanced dataset suitable for unbiased binary classification was achieved.
This preprocessing approach was consistently applied across the remaining three categories to ensure uniform data quality and reliability in subsequent analyses.
3.3. Machine Learning Model Selection and Fitting
In this study, the extra trees classifier (ET) was selected as our predictive model due to its superior performance across various metrics, specifically the F1-score, as it provides a more robust evaluation with imbalanced data. Utilizing the PyCaret library [
52], we automated the evaluation process for multiple machine learning models, thereby streamlining our workflow and objectively assessing model performance based on key metrics such as accuracy, AUC, recall, precision, and F1-score. Among the 14 models evaluated—which included the extra trees classifier (ET), light gradient boosting (LightGBM), random forest classifier (RF), quadratic discriminant analysis (QDA), extreme gradient boosting (XGBoost), gradient boosting classifier (GBC), AdaBoost classifier (Ada), Ridge classifier (Ridge), linear discriminant analysis (LDA), K-neighbors classifier (KNN), decision tree classifier (DT), logistic regression (LR), naive Bayes (NB), and support vector machines (SVM) with a linear kernel-the extra trees classifier consistently demonstrated superior performance compared to other models in terms of average metric values across each quadrant. The results for the mean values of the calculated metrics in all four quadrants are shown in
Table 3.
It is critical to select a model that exhibits high predictive performance to ensure the reliability of XAI techniques SHAP and LIME. These values aim to accurately reflect the significance and contribution of each feature to the model’s predictions. Choosing a poorly performing model could result in misleading interpretations, as the SHAP values would be describing an inaccurate representation of the model’s behavior.
After the initial preprocessing, the dataset was partitioned into training and testing sets in an 80:20 ratio, yielding 374 rows for training and 94 rows for testing. Subsequently, we applied the ET with its default parameters to each quadrant individually within our dataset. This step was crucial to evaluating the model’s performance across different segments of the data and ensuring robustness in its predictive capabilities.
3.4. Application of SHAP for Model Interpretation
Upon establishing a robust machine learning model as the foundation for our analysis, we focused on integrating XAI techniques to enhance the interpretability of the model. Specifically, we utilized the Shapley Additive Explanations (SHAP) algorithm as a tool for assessing the model’s global explainability. SHAP provides a comprehensive framework for interpreting the predictions made by machine learning models [
11]. This method assigns a SHAP value to each feature, quantifying its relative impact on the model’s decision-making process. These values facilitate both local and global explanations, offering insights into the influence of individual features on specific predictions and the model as a whole [
53]. The mathematical formulation of SHAP values ensures that the contribution of each feature is accurately represented, thereby enabling a deeper understanding of the model’s internal mechanics.
where
x represents a certain sample requiring explanation,
f denotes the model under consideration,
i identifies the feature being assessed, and
M signifies the total count of features. Additionally,
encompasses every conceivable variation or disturbance of
x.
In this study, we leveraged the SHAP library [
11] in Python to compute SHAP values for the features in our dataset. Central to the SHAP framework is the assignment of a SHAP value to each feature, quantifying its incremental impact on the model’s prediction for a given instance. Originally adapted from cooperative game theory, where SHAP values ensure a fair allocation of benefits among players, these values are reinterpreted within the SHAP framework to distribute the output of the model among features based on their relative contribution [
53]. This approach is particularly advantageous for complex models, where discerning the impact of individual features can be challenging.
Our analysis focused on evaluating the relative importance of features within the dataset across the individual CMA quadrants. Given the unique emotional profiles of each quadrant, specific eye-tracking metrics likely influence distinct emotional states. By applying the SHAP algorithm to our model, we aimed to uncover which eye-tracking features are most influential in evoking specific emotions, thereby providing insights into how human emotions can be effectively identified from eye-tracking data. Our computation of SHAP values for each feature allowed us to identify key features that significantly affect the model’s decision-making process in each quadrant. Notably, some of the features assessed either negatively impacted the model’s decisions or had no discernible effect. This observation suggests the potential for developing a similarly effective model with a reduced set of features.
3.6. Principal Component Analysis (PCA)
After applying SHAP and LIME to each of the 49 features of the dataset and analyzing their impact on the model, we employed principal component analysis (PCA) as a dimensionality reduction technique to validate our findings. PCA not only mitigates the curse of dimensionality but also unveils the underlying structure of the data by transforming the original features into a set of orthogonal components. With PCA, we created new features by grouping fixations, blinks, saccades, and micro-saccades. As a result of this, we had 15 features named: Number of Micro-Saccade, Number of Blink, Number of Saccade, Number of Fixations, Blink Duration, Fixation Duration, Saccade Duration, Saccade Direction, Saccade Amplitude, Saccade Length, Micro-Saccade Amplitude, Micro Saccade Direction, Micro Saccade Peak Velocity, Micro Saccade Vertical Amplitude, and Micro Saccade Horizontal Amplitude.
We applied both SHAP and LIME, respectively, to these newly derived features for interpretability following ET and thoroughly analyzed the results. The findings corroborated our previous results obtained using the original 49 features, reinforcing the validity and consistency of our initial interpretations.