Predicting User Attention States from Multimodal Eye–Hand Data in VR Selection Tasks

Du, Xiaoxi; Wu, Jinchun; Tang, Xinyi; Lv, Xiaolei; Jia, Lesong; Xue, Chengqi

doi:10.3390/electronics14102052

Open AccessArticle

Predicting User Attention States from Multimodal Eye–Hand Data in VR Selection Tasks

by

Xiaoxi Du

¹

,

Jinchun Wu

¹

,

Xinyi Tang

¹

,

Xiaolei Lv

¹

,

Lesong Jia

^1,2,*

and

Chengqi Xue

^1,*

¹

School of Mechanical Engineering, Southeast University, Nanjing 211189, China

²

School of Computer and Information, University of Pittsburgh, Pittsburgh, PA 15260, USA

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(10), 2052; https://doi.org/10.3390/electronics14102052

Submission received: 16 April 2025 / Revised: 10 May 2025 / Accepted: 14 May 2025 / Published: 19 May 2025

Download

Browse Figures

Versions Notes

Abstract

Virtual reality (VR) devices that integrate eye-tracking and hand-tracking technologies can capture users’ natural eye–hand data in real time within a three-dimensional virtual space, providing new opportunities to explore users’ attentional states during natural 3D interactions. This study aims to develop an attention-state prediction model based on the multimodal fusion of eye and hand features, which distinguishes whether users primarily employ goal-directed attention or stimulus-driven attention during the execution of their intentions. In our experiment, we collected three types of data—eye movements, hand movements, and pupil changes—and instructed participants to complete a virtual button selection task. This setup allowed us to establish a binary ground truth label for attentional state during the execution of selection intentions for model training. To investigate the impact of different time windows on prediction performance, we designed eight time windows ranging from 0 to 4.0 s (in increments of 0.5 s) and compared the performance of eleven algorithms, including logistic regression, support vector machine, naïve Bayes, k-nearest neighbors, decision tree, linear discriminant analysis, random forest, AdaBoost, gradient boosting, XGBoost, and neural networks. The results indicate that, within the 3 s window, the gradient boosting model performed best, achieving a weighted F1-score of 0.8835 and an Accuracy of 0.8860. Furthermore, the analysis of feature importance demonstrated that the multimodal eye–hand features play a critical role in the prediction. Overall, this study introduces an innovative approach that integrates three types of multimodal eye–hand behavioral and physiological data within a virtual reality interaction context. This framework provides both theoretical and methodological support for predicting users’ attentional states within short time windows and contributes practical guidance for the design of attention-adaptive 3D interfaces. In addition, the proposed multimodal eye–hand data fusion framework also demonstrates potential applicability in other three-dimensional interaction domains, such as game experience optimization, rehabilitation training, and driver attention monitoring.

Keywords:

attention state; eye-hand data; multimodal features; virtual interface; selection task

1. Introduction

With the rapid advancement of intelligent interaction technologies, human–computer interaction (HCI) has been gradually shifting from the traditional “passive response” model toward a more intelligent, “proactive response” paradigm. Modern intelligent interactive systems no longer merely wait for users’ explicit commands before responding; instead, they increasingly emphasize the real-time, proactive prediction of users’ intentions and cognitive states [1,2]. In recent years, virtual reality (VR) systems have attracted significant attention, thanks to their immersive displays and flexible, natural interactions. Within VR environments, users can fully engage their sensory and motor channels to interact with virtual content; at the same time, VR technologies can implicitly, precisely, and comprehensively track the entire interaction process [3]. These capabilities strongly support the development of real-time adaptive and intelligent interactive systems.

User intention generally refers to the underlying motivations and goals that drive users to act while interacting with a system [4,5]. During intention execution, attention serves as a crucial cognitive mechanism. Attention can be controlled through two main modes as follows [6]: stimulus-driven (bottom-up) attention triggered by salient or novel external stimuli, and goal-directed (top-down) attention guided by prior knowledge, explicit planning, and current objectives. As interfaces serve as carriers of information and functionality in HCI [7], user intentions often induce goal-directed attention to locate relevant interface elements for task completion, while salient interface stimuli may trigger stimulus-driven attention and thereby influence users’ decisions and behaviors. Accordingly, understanding and identifying attentional patterns when users execute their intentions can help designers and developers create more intelligent and adaptive interfaces.

Nonetheless, from an external standpoint, intentions and attention within the human brain resemble a “black box”. Researchers must rely on observable cues to infer the underlying processes and outcomes. Existing studies have explored various forms of data—such as EEG signals [8], eye-tracking measurements [9,10,11], and mouse operations [12]—to analyze and predict users’ cognitive activities, including intention type, areas of interest, and degree of attentional focus. However, most such investigations focus on traditional two-dimensional (2D) interaction scenarios, and with the growing ubiquity of three-dimensional (3D) interfaces, these findings may not generalize to more natural 3D environments. A method to leverage multimodal eye–hand behavior data within controlled 3D virtual environments to elucidate users’ cognitive and decision-making processes remains an area requiring further investigation.

In recent years, VR systems that integrate eye-tracking and gesture-recognition technologies have enabled real-time capture of eye movements [13,14] and hand movements [15,16] within controlled 3D environments. During more natural interactions in the virtual environment—especially when using direct bare-hand input—both eye-movement and hand-movement data are captured in real time. Since the eyes serve as the primary channel of perception and the hands as the main channel of action, eye–hand coordination is considered a key sensorimotor interface between the human brain and the physical world [17]. Eye–hand coordination provides a rich source of information. Accordingly, using natural eye–hand data from virtual environments to infer users’ attentional states during intention execution has become the focus of this study.

Given these considerations, the present study aims to collect multimodal eye–hand data from VR environments, screening for behavioral and pupillary physiological features to construct a predictive model capable of distinguishing between stimulus-driven attention and goal-directed attention. We hope that the proposed attention-state prediction model may offer a valuable perspective on understanding user intentions and attention, thereby fostering the development of more intelligent, proactive, and adaptive VR interfaces, see Figure 1.

2. Related Work

2.1. Intention and Attention

Intention and attention have long been central research topics in cognitive psychology and neuroscience. Their interplay is particularly critical in understanding user behavior prediction and in designing intelligent interfaces. Within the framework of the theory of planned behavior (TPB), intention is defined both as an attempt to perform a specific behavior—distinct from the actual behavior itself [18]—and as a motivational factor influencing user actions [4,5]. It is a key psychological precursor that drives behavior and is significantly reinforced by users’ positive attitudes toward the behavior and their perceived control over it. Attention is a general concept [19], referring to the process by which the brain selects and focuses on certain information while ignoring others. Two primary modes of attention control are widely recognized. One is goal-directed (top-down) attention, which is “endogenous”, guided by prior knowledge, expectations, and current goals. This mode is relatively slow, task-driven, voluntary, and operates in a closed loop [20]. The other is stimulus-driven (bottom-up) attention, which is “exogenous”, triggered by salient or novel features in a visual scene, and characterized as automatic, reflexive, or peripherally cued [21].

Intention and attention are tightly coupled [22,23,24]. On the one hand, intention can guide goal-directed attention by clearly allocating users’ attentional resources according to task objectives. On the other hand, stimulus-driven attention may influence and even alter a user’s intention. Guided search theory [25,26] offers further insight into these mechanisms, suggesting that by assigning different attentional weights to various features, the system can effectively steer users’ visual search patterns toward objects of potential interest. Therefore, in human–computer interaction, attention can be regarded as the bridge linking interface stimuli and user intention.

In contemporary digital information systems, semantic-icon interfaces are widely utilized [27], employing graphics and text [28,29] aligned with users’ existing knowledge and experience [30,31] to convey functional semantics, engage users, and foster immersion [32]. During interactions, users employ a top-down attentional process by establishing goals and requirements, thereby focusing their attention on task-relevant icons to complete specific operations. In contrast, Feature Integration Theory [33,34] posits that when a target stimulus uniquely differs from surrounding distractors in basic visual features—such as color, shape, orientation, or motion—it immediately “pops out” in the visual field and automatically captures attention through a bottom-up mechanism. Extensive research has confirmed that a single, salient color efficiently draws attention [35,36]. Consequently, color-coded button interfaces are particularly effective in dynamic and complex environments—including virtual reality, augmented reality, in-vehicle information systems, and industrial control panels—where they significantly impact task completion time and interface visibility. For example, in motion-rich environments, orange buttons are most readily detected, whereas blue buttons perform relatively poorly [37]; similarly, color-coded buttons in applications such as virtual music keyboards facilitate faster user comprehension and operation [38].

In summary, a comprehensive understanding of both top-down and bottom-up attentional mechanisms, combined with the effective utilization of users’ attentional resources, can facilitate the successful realization of users’ intentions in human–computer interfaces. Accordingly, this study employs color-coded button interfaces and semantic-icon interfaces to systematically elicit and differentiate these two attentional modes, thereby enabling an in-depth analysis of their distinct behavioral characteristics.

2.2. Cognitive-Related Data and Modeling

Eye movement is one of the most informative indicators of human cognitive states. By analyzing eye-tracking data, researchers can infer users’ search patterns, areas of interest, and cognitive load. The application of eye-tracking technology transforms internal cognitive activities into measurable data streams, providing essential clues for cognitive modeling in human–computer interaction. Early work by Yarbus (1967) established that eye movements are regulated top-down by different task demands and that an observer’s intention can be inferred from eye-movement patterns [39]. Building on this foundation, Borji and Itti (2014) successfully classified the seven task types originally defined by Yarbus using eye-tracking data [40]. Jang et al. (2014) classified human implicit navigation and information intentions in a visual search task based on eye-movement patterns and pupillary changes [41]. Joseph et al. (2018) constructed a computational model of cognitive states in a cueing task via dynamic Bayesian networks [42]. Kootstra et al. (2020) further expanded the extracted eye-movement features to build a classifier for decoding users’ cognitive states during search, rating, memory tasks, and task switching [43]. In specialized domains, eye-tracking data have also been shown to be highly effective. For instance, Jiang et al. (2022) classified pilot attentional states based on visual fixation behaviors [11]. Kotseruba et al. (2022) conducted a comprehensive review of methods for modeling and detecting driver attention states using gaze data in driving scenarios [44]. These findings reinforce the reliability of eye-movement features for identifying and predicting various user cognitive states and task types.

Meanwhile, hand-movement data—an equally robust indicator of action intention—have also garnered considerable attention. Studies show that the cursor’s position often aligns closely with the user’s gaze [45]. Raghunath et al. (2012), examining pathologists’ viewing behaviors, demonstrated that mouse cursor movements are spatially coupled with eye fixations, allowing the cursor position to partially predict visual attention patterns [46]. In web-search research, mouse-tracking data serve as a “covert indicator of interest”, revealing which search results users view even without explicit clicks [47]. Hence, hand data reflect not only a user’s overt action intentions but also enrich the dynamic details of the user’s attentional process.

In addition, other data sources have been widely used to model and predict cognitive states, such as sEMG for anticipating limb dynamics and recognizing gestures [48,49,50,51], EEG for decoding movement intentions [52,53], and skeletal data for inferring interactive intentions with 3D models [54]. Recently, multimodal feature fusion has emerged as a promising approach, exemplified by combining EEG and eye-tracking to predict users’ informational and navigational intentions with high accuracy [55].

In summary, a broad range of physiological and behavioral data sources serves as a rich foundation for modeling and predicting user cognitive states. Eye-tracking and manual interaction data, in particular, can be captured directly and are highly informative—providing reliable features for cognitive state prediction. Given that interactions in virtual reality (VR) occur within fully three-dimensional spaces—and that natural, direct bare-hand interaction fundamentally differs from traditional, indirect mouse-based methods—conventional two-dimensional interaction data may be inadequate for capturing users’ attentional distribution and intention expression in immersive environments. Building on prior research and capitalizing on the data acquisition capabilities of VR systems, this study focuses on natural eye–hand behaviors in 3D settings, integrating eye movements, hand movements, and pupil dynamics to predict users’ attentional states.

3. Dataset

The data used to develop the attention-state prediction model for selection intentions in a virtual reality (VR) environment were collected in a VR experiment. This study complied with the Declaration of Helsinki and was approved by the Ethics Committee of the Affiliated Hospital of Southeast University.

3.1. Participants

A total of 30 students (15 male, 15 female) from Southeast University, aged 22–30 (M = 24.55, SD = 2.18) and with heights ranging from 155 cm to 185 cm, took part in the study. All participants had normal or corrected-to-normal vision, were right-handed, and reported no cognitive or motor impairments. They were unaware of the study’s objectives and did not have prior expertise using VR devices. After completing the experiment, each participant received monetary compensation.

3.2. Apparatus and Materials

We employed a Varjo XR3 VR system (Varjo Technologies, Helsinki, Finland), which integrates 200 Hz eye tracking and a 120 Hz Ultraleap hand-tracking module, with a total device weight of 980 g. The headset’s display was set to its highest resolution (39 PPD), providing a dual-eye resolution of 2192 × 1880 (peripheral region) and 1200 × 1200 (central region), a horizontal field of view of 115°, and a refresh rate of 90 Hz. Participants performed the tasks while seated, see Figure 2. To reduce the load from the headset, a suspension roller support was positioned overhead. The experimental program was developed in Unity (version 2021.3.15f1c1) and ran on a high-performance computer equipped with an Nvidia RTX 3090 GPU, an AMD R9-5950X CPU, and 32 GB of RAM to ensure stable system operation.

Two types of virtual interfaces were used in the experiment. Each interface contained nine buttons arranged in a 3 × 3 matrix as follows:

Non-semantic, color-coded button interface: This interface consists of eight blue buttons and one red button, as shown in Figure 3a. Each button position hosted the red button once, resulting in nine unique interfaces.
Semantic icon-and-text button interface: This interface contains nine icon-based buttons, each featuring both graphics and text on a blue background with white icons (Figure 3b). By permuting the positions of these nine buttons, we generated nine distinct semantic interfaces.

3.3. Experimental Design

This experiment comprises two groups of button selection tasks, each corresponding to one of two distinct types of virtual interfaces.

Task 1—stimulus-driven button selection: In each trial, one red button randomly appeared among the nine buttons on the virtual interface (see in Figure 3a). Each position hosting the red button was repeated three times. Participants tapped the red button with the index finger of their virtual right hand, upon which the entire interface vanished immediately.
Task 2—goal-driven button selection: Before each trial, the virtual environment displayed a random instruction in Chinese indicating a specific functional requirement (i.e., a target button and its position; see Figure 3b). Each instruction was repeated three times. Participants tapped the corresponding target button with their right index finger in the virtual environment, causing the interface to disappear automatically.

Each task group comprised 9 × 3 = 27 trials, resulting in a total of 54 trials (27 × 2) across both groups. The formal experimental procedure for each group is illustrated in Figure 4. Initially, the virtual interface was placed approximately 450 mm (0.45 m) from the participant’s eyes, with the interface center at a height of 120 mm (1.2 m) above the floor. Participants adjusted their seat height and orientation so that their eyes were aligned with the crosshair used for calibration (see Figure 2).

3.4. Experimental Procedure

After signing an informed consent form and completing a basic information questionnaire, participants donned the Varjo XR-3 headset and sat in a comfortable position. They first performed practice trials to become familiar with the task. The formal experiment comprised two task groups, completed in a counterbalanced order to mitigate group effects. Before each group, a five-point eye-tracking calibration was conducted. During each trial, once the virtual interface appeared, eye-tracking and hand-movement data (specifically from the right index fingertip) were recorded in real time. Each group took approximately 5–10 min to complete. After finishing one group, participants removed the headset and took a short break. Once ready, they proceeded to the next group. The overall experimental flow is shown in Figure 4, and the entire experiment lasted approximately 20–30 min.

4. Attention State Prediction Model Development

We collected the following three types of data from participants: gaze sampling data, right hand index fingertip movement data, and pupil-size data. Due to occasional device malfunctions in eye tracking, data from one participant were excluded. As a result, we used the complete data from the remaining 29 participants. We first performed preprocessing on the raw data, extracted 38 features, and, based on the task design, identified the fundamental ground truth for user attention-mode classification. Next, we employed a leave-one-out nested cross-validation approach for hyperparameter tuning, model training, and performance evaluation, enabling the comprehensive comparison and analysis of various models.

4.1. Data Preprocessing

The raw eye-tracking data included each gaze sampling point’s initial coordinates, directional vectors, and pupil-diameter information, while the raw hand-movement data consisted of spatial position information. Both datasets were time-stamped for alignment.

For the eye-tracking data, we used an eye-movement segmentation algorithm [56,57] to classify the raw data into fixations, saccades, and blinks, and we then extracted features related to each event. For the hand-movement data, we applied 30 Hz and 9 Hz low-pass filters in sequence to reduce noise [58], followed by calculating a series of kinematic indicators to represent hand-movement characteristics. Additionally, for pupil-size data, we designated the average pupil diameter measured from 0.0 to 0.5 s after the virtual interface appeared as the baseline [59]. We then recorded subsequent pupil-size values as relative deviations from this baseline, constituting the physiological-feature component.

During preliminary feature aggregation, we identified several extreme outliers (absolute values exceeding thrice the interquartile range). Analysis suggested these outliers likely originated from intermittent identification issues with the Varjo XR3 device during the experiment. To prevent these outliers from skewing the results, we replaced them with the respective feature median.

4.2. Feature Extraction and Ground Truth

To enable the early detection of users’ attention states during the selection task, we incorporated time-series data into a supervised learning framework and aggregated eye-movement behavior, hand kinematics, and pupil-size changes within different time windows. Specifically, these windows began at the moment the virtual interface appeared (trial onset) and ended after a specified duration from the start of the trial. To validate our choice of window lengths, we first conducted a statistical analysis of completion times for the two button-selection tasks under different attentional conditions. The mean completion time for top-down attention–driven tasks was 3.45 s (SD = 1.56; range = 1.07–9.47 s), whereas for bottom-up attention-driven tasks it was 2.17 s (SD = 0.69; range = 0.69–4.61 s). A paired t-test confirmed that this difference was statistically significant (t(782) = 22.07, p < 0.001), and a Wilcoxon signed-rank test yielded the same conclusion (V = 276401, p < 0.001). Based on these findings, we defined eight time window lengths ranging from 0 to 4 s in 0.5 s increments. Thereafter, to ensure that every retained trial fully covered its assigned window, we excluded all trials whose actual completion time was shorter than the specified window length, thus guaranteeing the integrity of feature extraction and model training data.

Three categories of features were selected as inputs to the model as follows: eye-movement event features, hand-movement dynamics features, and pupil-size changes. The specific features and their descriptions are presented in Table 1. To mitigate the effect of redundant features on predictive performance, we used mutual information analysis (computing the average result across 50 runs with different random seeds) and correlation analysis to select 24 features with high importance and low inter-feature correlation, forming our final feature set.

4.3. Model Development

To determine the best model for predicting users’ attention states during the selection process, we systematically compared various algorithms, including logistic regression (LR), naïve Bayes (NB), decision tree (DT), random forest (RF), linear discriminant analysis (LDA), support vector machine (SVM), k-nearest neighbor (KNN), gradient boosting (GB), AdaBoost, XGBoost, and neural network (NN). Given the complexity and potential challenges of collecting human behavioral data, we used a leave-one-out nested cross-validation method to achieve an unbiased estimate of each model’s generalizability.

As shown in Figure 5, the inner cross-validation employed a 10-fold procedure on data from 28 participants for hyperparameter optimization and initial model selection. The outer cross-validation used a leave-one-out scheme, each time designating the data of one participant as a fully independent test set, ensuring that no test data were used in either training or parameter tuning. We assumed that each participant’s attention-mode distribution approximates that of the overall dataset, thus allowing every participant’s data to appear in the test set exactly once. This method offered a reliable evaluation of model generalization. All modeling steps were implemented in Python (version 3.11.5).

For the binary-classification task in this study, Precision, Recall, Accuracy, and F1-score were adopted as the primary evaluation metrics. These metrics are defined as follows (see Formulas (1)–(4)):

\begin{matrix} Precision & = \frac{T P}{T P + F P} \end{matrix}

(1)

\begin{matrix} Recall & = \frac{T P}{T P + F N} \end{matrix}

(2)

\begin{matrix} Accuracy & = \frac{T P + T N}{T P + T N + F P + F N} \end{matrix}

(3)

\begin{matrix} F 1 - score & = \frac{2 \times Precision \times Recall}{Precision + Recall} \end{matrix}

(4)

where

T P

,

T N

,

F P

, and

F N

denote the numbers of true positives, true negatives, false positives, and false negatives, respectively.

Precision reflects the proportion of correctly predicted positive samples among all predicted positives. Recall measures the proportion of actual positive samples that are correctly identified. Accuracy, the simplest and most commonly used metric, indicates the proportion of all samples that are correctly classified, providing an intuitive measure of overall classification performance. The F1-score, as the harmonic mean of Precision and Recall, balances these two aspects into a single value, making it suitable for assessing performance on imbalanced data. Therefore, subsequent model comparisons will primarily report on the Accuracy and F1-score.

5. Results

To enhance model robustness, we applied leave-one-out nested cross-validation to each model under every time window, thereby obtaining reliable performance estimates. In this section, we first conduct a comprehensive comparison of prediction performance across different models and time windows. Next, by identifying the highest F1-score and Accuracy for each time window, we determine the optimal time window and the corresponding best model, followed by a detailed analysis of this selected window and model. Finally, we discuss how different feature subsets affect prediction performance.

5.1. Comparative Performance Across Models

We used repeated-measures ANOVA to examine how various machine learning algorithms influence the Accuracy and F1-score of predicting attentional states in the selection task. Results show that different algorithms significantly affect both Accuracy (F(10,70) = 33.9034, p < 0.001) and F1-score (F(10,70) = 25.7171, p < 0.001). Figure 6 illustrates the weighted F1-scores and Accuracies of the different models across time windows, together with their corresponding 95% confidence intervals (CIs). In general, NB, DT, KNN, and NN performed slightly worse than the others. Specifically, GB attains the highest point estimates (F1

= 0.834 \pm 0.045

; Accuracy

= 0.838 \pm 0.044

), although its 95% CIs largely overlap with those of XG, Ada, and RF. RF yields the narrowest CIs (F1 CI

= [0.785, 0.861]

; Accuracy CI

= [0.793, 0.866]

), indicating the most stable performance across folds. By contrast, NB records the lowest means (F1

= 0.761 \pm 0.063

; Accuracy

= 0.773 \pm 0.053

) and the widest intervals, reflecting both lower and more variable performance.

To further elucidate performance differences among models, we conducted pairwise t-tests within each time window and applied a Bonferroni correction (

α = 0.05

). For the F1-score, AdaBoost, GB, LR, SVM, and XGBoost differ significantly from DT, KNN, and NB (

p < 0.05

); LDA differs significantly from DT and NB (

p < 0.05

); and RF differs significantly from DT and KNN (

p < 0.05

). Regarding Accuracy, AdaBoost, GB, LR, SVM, XGBoost, LDA, and RF all differ significantly from DT, KNN, and NB (

p < 0.05

), and GB and NN also exhibit a significant difference in Accuracy (

p < 0.05

).

5.2. Effects of Different Time Windows

Table 2 lists the models with the highest weighted F1-scores in each time window and their corresponding Accuracies. Overall, except for the 0.5 s window, both the weighted F1-scores and Accuracies exceed 0.80 in the other windows, and their performance remains relatively close. As the window length increases, the weighted F1-score and Accuracy fluctuate overall—rising initially, then declining, and rising again—exhibiting a wave-like pattern in Figure 7.

Specifically, in the 3.0 s time window, GB achieves the highest weighted F1-score (0.8835) and Accuracy (0.8860). The 3.5 s window performs nearly as well at 0.8790 and 0.8839 for weighted F1-score and Accuracy, respectively. Additionally, for the relatively short 1.0 s window, XGBoost stands out, providing locally optimal performance.

For the 3 s window that yielded the highest predictive performance, the Pearson correlation between the actual completion times of all valid trials and their model-predicted probabilities was calculated (r = −0.04, p = 0.40). This non-significant result indicates no systematic linear relationship between trial duration and model output, demonstrating that task duration did not bias feature extraction or model discrimination within this window. Therefore, balancing the need for early prediction of user attentional states with model stability, we ultimately choose the 3.0 s window and GB as the best-performing combination.

5.3. Analysis of the Optimal Window and Model

Within the optimal 3 s time window, the model achieves a weighted precision of 89.9% and a weighted recall of 88.6%, indicating substantial completeness and balance in its predictions. The corresponding confusion matrix and ROC curve are shown in Figure 8.

Furthermore, we analyzed the importance of each feature in the complete set, where the total importance scores of all 24 features sum to 1, see Figure 9 and Figure 10. Table 3 lists the top ten features based on their respective scores. We found that maximum hand-movement acceleration, fixation ratio, and average saccade amplitude contributed the most. These were followed by the hand’s average velocity along the Z (depth)-axis, total fixation duration, and the maximum pupil-size change. All of these features play a crucial role in predicting users’ attentional states.

Following the overall analysis of the optimal model and the full feature set, we further examined the model’s differential performance across the two attentional states. The confusion matrix revealed a clear asymmetry in classification errors as follows: the false positive rate for misclassifying stimulus-driven bottom-up attention as goal-directed top-down attention reached 26.82%, whereas the false negative rate in the reverse direction was only 6.65%. To quantify this disparity, we calculated classification metrics separately for each attentional condition. The results indicated that the model achieved superior predictive performance for top-down attention under the semantic icon interface (Precision = 0.895; Recall = 0.946; F1 = 0.920), while the performance for bottom-up attention under the color-coded interface was relatively lower (Precision = 0.859; Recall = 0.749; F1 = 0.800). Although the overall model Accuracy reached 88.6%, the decision boundaries appeared more prone to confusion when processing bottom-up attention samples.

To further identify the critical features underlying this classification disparity, we computed the mean SHAP values for all test samples within each attentional category and evaluated statistical significance using Welch’s t-test with Benjamini–Hochberg correction, see Table 4. A total of 11 features showed significant differences between the two attentional states (

p < 0.05

). Among them, the peak hand acceleration exhibited the largest positive mean difference

({mean}_{t o p - d o w n} - {mean}_{b o t t o m - u p} = 0.022)

, substantially contributing to the prediction of top-down attention. In contrast, the mean hand acceleration had the most negative difference

({mean}_{t o p - d o w n} - {mean}_{b o t t o m - u p} = - 0.014)

, serving as the primary driver for bottom-up predictions. Table 5 lists the top five driving features for each attentional type, ranked by their absolute SHAP mean values, further illustrating the distinct feature dependencies across conditions. Eye-movement event–related features, including total fixation duration, saccade rate, and fixation rate, were key predictors for top-down attention, whereas transition entropy and both average and maximum saccade amplitudes were more important for recognizing bottom-up attention.

5.4. Impact of Different Feature Categories on Prediction Results

Using the 3 s window and GB model combination, which achieved the best predictive performance, we divided the features into three subsets—eye-movement behavior, hand-movement dynamics, and pupil-based physiological signals—and again applied leave-one-out nested cross-validation to compare their prediction performance, see Figure 11. The results show that using only eye-movement features yielded relatively high weighted F1 (0.8457) and Accuracy (0.8505), with average saccade amplitude, fixation ratio, and total fixation duration being the most critical features. When using only hand-movement features, performance slightly declined (F1 = 0.7976, Accuracy = 0.8150), where maximum acceleration and average velocity along the Z-axis played key roles. Relying solely on pupil-based physiological features led to a more pronounced drop in performance (F1 = 0.7067, Accuracy = 0.7361), indicating that these signals alone have limited predictive capability.

6. Discussion

6.1. Model Performance Comparison

Previous research on cognitive state modeling has primarily focused on single-modal data sources, particularly eye-tracking [11,40,41,42,43] or electroencephalogram (EEG) signals [61]. For example, in controlled environments, numerous EEG-based studies have reported classification Accuracies exceeding 70% for different attentional states [61], with some achieving up to 96.7% [62]. Among these, SVM and LDA have been the most commonly used and effective traditional methods [61]. Studies using eye-tracking data have also yielded promising results—for instance, Adrian et al. (2024) employed a multilayer perceptron (MLP) model with four eye-tracking features selected through variance analysis, achieving an Accuracy of 81.2% in classifying attention into high, medium, and low levels [63]. By contrast, our study’s model leverages the convenience of collecting eye–hand data in a virtual reality system and incorporates a richer, more multidimensional feature set by integrating eye-movement behavior, hand-movement dynamics, and pupillary physiological changes—thereby demonstrating a stronger multimodal synergy. Within a 3 s time window, the model achieved classification performance comparable to that of EEG- or eye-tracking–based models reported in prior studies. In addition, because our method employs a leave-one-out cross-validation approach across participants, all testing samples are derived from previously unseen users. This setup effectively examines the model’s ability to generalize to new users’ attentional states.

In particular, among the models we examined, ensemble methods based on bagging and boosting exhibit noticeably superior performance. One likely reason is that individual models may struggle to capture the complex nonlinear relationships and potential noise in the data. In addition, different models may offer advantages in extracting distinct patterns from the dataset. By integrating multiple models, the approach can capitalize on their respective strengths, thus achieving more robust prediction performance. These results underscore the effectiveness of using ensemble methods rather than single models, especially in complex domains such as human behavior data. However, improvements in predictive performance often come at the cost of interpretability. Compared with intrinsically transparent linear models—such as logistic regression and linear discriminant analysis—ensemble methods are considerably more “black-box”. To address this trade-off, once GB was identified as the best-performing model, a two-tier interpretability analysis was undertaken. First, global feature-importance rankings and cumulative contribution curves quantified the relative influence of each predictor. Second, SHAP (SHapley Additive exPlanations) values were employed to uncover class-specific decision patterns, revealing the key features on which the model differentially relies when discriminating between top-down and bottom-up attentional states. This workflow preserves the high predictive Accuracy of the GB model while providing clear, intuitive insight into the mechanisms by which it distinguishes the two attentional categories.

6.2. Effect of Time-Window Size

Time-window selection has a significant and intricate influence on model performance. Our analysis indicates that this effect is not simply linear. First, as the time window length increases, the amount of data meeting the corresponding duration requirement gradually decreases, directly influencing the model’s stability and generalizability. Second, while a longer window can capture more information, it may also introduce additional noise or redundant data that dilute the truly critical information, thereby reducing prediction accuracy. Third, longer windows smooth out rapid fluctuations and diminish the sensitivity to key events. This effect is particularly relevant for capturing transient characteristics such as physiological signals (e.g., pupil-size changes), which often respond promptly to shifts in cognitive load. Existing research shows that these signals are more prominent within shorter windows [64,65].

Our study finds that a 3 s time window strikes the best balance between predictive performance and practical utility. Although the optimal window size may vary across specific scenarios, our findings offer practical guidance for real-time attention-state detection in similar task environments.

6.3. Feature Importance

In analyzing feature contributions, we found that eye-movement features alone still yielded high predictive performance (weighted F1 = 0.8457; Accuracy = 0.8505). This result underscores the effectiveness of eye-movement data in identifying users’ attentional states and aligns with prior research, which has consistently verified eye-movement features as key indicators of attention [66,67].

By comparison, using only hand-movement features produced slightly lower but still robust performance (weighted F1 = 0.7976; Accuracy = 0.8150). Among these, the maximum acceleration and average velocity in the depth dimension emerged as critical features. Although hand-movement data alone are somewhat less predictive than eye-movement data, they retain significant predictive value, particularly given the tight coupling between movement-control strategies that reflect user intentions and eye-movement patterns [68,69].

When only pupil-based physiological features were used, model performance declined more noticeably (weighted F1 = 0.7067; Accuracy = 0.7361). This finding implies that relying solely on pupil changes may be insufficient for capturing the complex dynamics of attentional states. Nevertheless, as direct indicators of physiological responses, pupil features can offer considerable complementary value when combined with eye-movement and hand-movement features.

6.4. Differences in Eye–Hand Coordination Across Attentional States

The optimal GB model established in this study exhibited a significant disparity in its predictive performance across the two attentional states. While it demonstrated nearly unbiased recognition of top-down attention, its recall for bottom-up attention was only 0.75, accompanied by a notably higher false positive rate. This discrepancy may be attributed to the inherent nature of the two attentional modes, outlined as follows: Goal-directed top-down attention is typically characterized by prolonged fixations and well-coordinated eye–hand actions, which tend to remain stable within the 3 s analysis window, thus making them more easily captured by the model. In contrast, bottom-up attention is elicited by salient external stimuli and is associated with more transient and stochastic eye–hand responses. The partial overlap between the feature distributions of the two states increases the risk of misclassification. These findings suggest that although the GB framework in our study achieves strong overall discriminative performance, it may be limited in handling highly dynamic perceptual–motor patterns.

SHAP analysis further elucidated the feature-level basis of this performance asymmetry. The model’s predictions of top-down attention heavily relied on indicators of oculomotor stability (e.g., total fixation duration) and peak acceleration of hand movement, indicating a likely user strategy of prolonged gaze followed by rapid hand movement in semantic-icon selection tasks. In contrast, bottom-up attention predictions depended more on sustained hand acceleration metrics (e.g., average acceleration) and high gaze transition entropy, suggesting a reactive strategy involving rapid visual scanning and concurrent hand motion in response to salient stimuli. This functional dissociation supports the variability and complexity of eye–hand coordination mechanisms under different attentional contexts. Specifically, in goal-directed interaction scenarios, gaze serves as the dominant perceptual channel providing continuous visual guidance, while hand movements execute rapid actions upon target acquisition. Conversely, stimulus-driven interactions feature ongoing hand movements coupled with exploratory saccades. Notably, the opposing SHAP contributions of average and peak hand acceleration in distinguishing between attentional states underscore the critical role of natural hand movement kinematics in predicting users’ internal cognitive and decision-making processes within 3D environments. This provides a valuable reference for multimodal feature engineering in user behavior modeling.

7. Implications and Limitations

7.1. Implications

This study provides empirical evidence for the relationship between natural eye–hand behavior and underlying cognitive states in a virtual reality environment. The results indicate that, even in a relatively straightforward VR button-selection task, users’ eye–hand coordination can yield rich and predictive information about their attentional states. These findings further validate the feasibility of inferring subtle internal cognitive states by leveraging machine learning techniques applied to externally measurable behaviors and physiological signals. They also lay a foundation for future work exploring more nuanced cognitive phenomena in complex, dynamic interactive environments with lightweight external sensors.

In addition, our eye–hand data-driven attentional mode prediction framework holds promise for achieving more refined and intuitive context awareness in future VR interface designs. By predicting users’ attentional states during the execution of selection intentions, this model enables the development of attention-adaptive interfaces that dynamically respond to users’ levels of attention. Specifically, interaction systems based on attentional-state predictions can minimize or delay notifications that are only loosely related to the primary task when users are deeply engaged in top-down goal-oriented tasks. When necessary, the system can also deliver salient contextual cues to reduce the cognitive load associated with top-down attention. Furthermore, based on information priority, it can forcibly capture the user’s attention through prominent stimuli during critical moments.

7.2. Limitations

Although this study has achieved preliminary success in predicting users’ attentional states, it remains an initial effort with certain limitations that warrant further investigation and refinement.

On one hand, our approach uses snapshot data from fixed time windows as input to the model. Although this method performs well in our task, the number of data samples meeting the time window requirement decreases as the window lengthens due to experimental design constraints. While the “one-tenth rule” in machine learning suggests that each feature should have at least ten corresponding data samples [70], the scale of the data still influences model generalizability. In addition, the relatively small sample size in this study (n = 29) may introduce optimistic bias in model performance. Although nested cross-validation offers a relatively unbiased estimate of generalization error, when the number of participants is small relative to the feature dimensionality, even carefully tuned models may inadvertently fit noise rather than the true signal. Future work should consider collecting larger datasets for modeling or exploring deep learning algorithms to predict users’ attentional states.

On the other hand, our data are derived from a relatively simple virtual button-selection task, making our model particularly suited to interactions in which attentional states remain relatively stable throughout the process. In real-world scenarios, however, tasks are often more complex, and attentional states can exhibit rapid and continuous fluctuations. Future research should focus on tasks that more closely approximate real-world conditions to better examine the model’s effectiveness in capturing dynamic attentional shifts. Employing advanced algorithms capable of modeling continuous temporal changes may further enhance predictive performance. In addition, eye- and hand-tracking modules embedded in VR headsets are susceptible to calibration drift after prolonged use or across multiple sessions. Because each participant in the present study completed all tasks within a single experimental session, the stability of the resulting user model under cross-day or multi-session conditions has not yet been assessed. Future studies should therefore implement rigorous tracker calibration before each session and adopt data-collection protocols that span multiple days and sessions, thereby enabling a systematic evaluation of the model’s generalizability and practical reliability.

8. Conclusions

In this study, we utilized natural eye–hand data within a virtual reality setting to build a gradient boosting–based classification model for predicting users’ attentional states while executing selection intentions. We evaluated different time windows, model types, and hyperparameter combinations; our results indicate that in the 0.0–3.0 s window, the gradient boosting algorithm performed best, achieving a weighted F1-score of 0.8835 and an Accuracy of 0.8860.

The main contributions of this study are as follows. We systematically constructed and analyzed a multimodal feature set comprising eye movement, hand movement, and pupil-based physiological signals, thus providing a valuable reference for feature engineering in similar scenarios. The developed model can inform the design of attention-adaptive interactive interfaces, thereby enhancing the efficiency and user experience in 3D virtual environments. In addition, the proposed multimodal eye–hand data fusion framework exhibits strong transferability and can be extended to other 3D interaction settings, including game-experience optimization, rehabilitation training systems, and driver attention detection.

Author Contributions

Conceptualization, X.D.; methodology, X.D. and L.J.; validation, X.D., J.W. and L.J.; formal analysis, X.D.; investigation, X.D.; resources, L.J.; data curation, X.D.; writing—original draft preparation, X.D.; writing—review and editing, X.D., J.W., X.T., and X.L.; visualization, X.D. and L.J.; supervision, C.X.; project administration, L.J.; funding acquisition, C.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant numbers 72271053, 52275238, and 71901061; and China’s Ministry of Education Project of Humanities and Social Sciences under Grant number 23YJCZH168.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee of Southeast University (protocol code 2024ZDSYLL257-P01; approval date: 9 July 2024).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The datasets used in this study are part of an ongoing research project and are therefore not available for public sharing. Any inquiries regarding data access should be directed to the corresponding author.

Acknowledgments

We would like to extend our sincere gratitude to the colleagues and experts who provided invaluable insights and feedback during this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Schmidt, A. Implicit human computer interaction through context. Pers. Technol. 2000, 4, 191–199. [Google Scholar] [CrossRef]
Karaman, Ç.Ç.; Sezgin, T.M. Gaze-based predictive user interfaces: Visualizing user intentions in the presence of uncertainty. Int. J. Hum.-Comput. Stud. 2018, 111, 78–91. [Google Scholar] [CrossRef]
Baker, C.; Fairclough, S.H. Adaptive virtual reality. In Current Research in Neuroadaptive Technology; Elsevier: Amsterdam, The Netherlands, 2022; pp. 159–176. [Google Scholar]
Ajzen, I. Understanding Attitudes and Predictiing Social Behavior; Prentice-Hall: Englewood Cliffs, NJ, USA, 1980. [Google Scholar]
Ajzen, I. The theory of planned behavior. Organ. Behav. Hum. Decis. Process. 1991, 50, 179–211. [Google Scholar] [CrossRef]
Katsuki, F.; Constantinidis, C. Bottom-up and top-down attention: Different processes and overlapping neural systems. Neuroscientist 2014, 20, 509–521. [Google Scholar] [CrossRef]
Du, X.; Yu, M.; Zhang, Z.; Tong, M.; Zhu, Y.; Xue, C. A Task-and Role-Oriented Design Method for Multi-User Collaborative Interfaces. Sensors 2025, 25, 1760. [Google Scholar] [CrossRef]
Kang, J.S.; Park, U.; Gonuguntla, V.; Veluvolu, K.C.; Lee, M. Human implicit intent recognition based on the phase synchrony of EEG signals. Pattern Recognit. Lett. 2015, 66, 144–152. [Google Scholar] [CrossRef]
Castner, N.; Geßler, L.; Geisler, D.; Hüttig, F.; Kasneci, E. Towards expert gaze modeling and recognition of a user’s attention in realtime. Procedia Comput. Sci. 2020, 176, 2020–2029. [Google Scholar] [CrossRef]
Lochbihler, A.; Wallace, B.; Van Benthem, K.; Herdman, C.; Sloan, W.; Brightman, K.; Goubran, R.; Knoefel, F.; Marshall, S. Metrics in a Dynamic Gaze Environment. In Proceedings of the 2024 IEEE International Symposium on Medical Measurements and Applications (MeMeA), Eindhoven, The Netherlands, 26–28 June 2024; pp. 1–6. [Google Scholar]
Jiang, G.; Chen, H.; Wang, C.; Zhou, G.; Raza, M. Analysis of Flight Attention State Based on Visual Gaze Behavior. In Proceedings of the International Conference on Multi-Modal Information Analytics, Hohhot, China, 22–23 April 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 942–950. [Google Scholar]
Huang, J.; White, R.; Buscher, G. User see, user point: Gaze and cursor alignment in web search. In Proceedings of the Sigchi Conference on Human Factors in Computing Systems, Austin, TX, USA, 5–10 May 2012; pp. 1341–1350. [Google Scholar]
Rappa, N.A.; Ledger, S.; Teo, T.; Wai Wong, K.; Power, B.; Hilliard, B. The use of eye tracking technology to explore learning and performance within virtual reality and mixed reality settings: A scoping review. Interact. Learn. Environ. 2022, 30, 1338–1350. [Google Scholar] [CrossRef]
Clay, V.; König, P.; Koenig, S. Eye tracking in virtual reality. J. Eye Mov. Res. 2019, 12, 10–16910. [Google Scholar] [CrossRef]
Wozniak, P.; Vauderwange, O.; Mandal, A.; Javahiraly, N.; Curticapean, D. Possible applications of the LEAP motion controller for more interactive simulated experiments in augmented or virtual reality. In Proceedings of the Optics Education and Outreach IV, SPIE, San Diego, CA, USA, 28 August–1 September 2016; Volume 9946, pp. 234–245. [Google Scholar]
Scheggi, S.; Meli, L.; Pacchierotti, C.; Prattichizzo, D. Touch the virtual reality: Using the leap motion controller for hand tracking and wearable tactile devices for immersive haptic rendering. In Proceedings of the ACM SIGGRAPH 2015 Posters, Los Angeles, CA, USA, 9–13 August 2015; p. 1. [Google Scholar]
Cariani, P.A. On the Design of Devices with Emergent Semantic Functions. Ph.D. Thesis, State University of New York Binghamton, Binghamton, NY, USA, 1989. [Google Scholar]
Ajzen, I. From intentions to actions: A theory of planned behavior. In Action Control: From Cognition to Behavior; Springer: Berlin/Heidelberg, Germany, 1985. [Google Scholar]
Borji, A.; Itti, L. State-of-the-art in visual attention modeling. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 185–207. [Google Scholar] [CrossRef]
Itti, L.; Koch, C. Computational modelling of visual attention. Nat. Rev. Neurosci. 2001, 2, 194–203. [Google Scholar] [CrossRef] [PubMed]
Egeth, H.E.; Yantis, S. Visual attention: Control, representation, and time course. Annu. Rev. Psychol. 1997, 48, 269–297. [Google Scholar] [CrossRef]
Lau, H.C.; Rogers, R.D.; Haggard, P.; Passingham, R.E. Attention to intention. Science 2004, 303, 1208–1210. [Google Scholar] [CrossRef] [PubMed]
Boussaoud, D. Attention versus intention in the primate premotor cortex. Neuroimage 2001, 14, S40–S45. [Google Scholar] [CrossRef]
Castiello, U. Understanding other people’s actions: Intention and attention. J. Exp. Psychol. Hum. Percept. Perform. 2003, 29, 416. [Google Scholar] [CrossRef]
Wolfe, J.M.; Cave, K.R.; Franzel, S.L. Guided search: An alternative to the feature integration model for visual search. J. Exp. Psychol. Hum. Percept. Perform. 1989, 15, 419. [Google Scholar] [CrossRef]
Wolfe, J.M. Guided Search 6.0: An updated model of visual search. Psychon. Bull. Rev. 2021, 28, 1060–1092. [Google Scholar] [CrossRef]
Shen, I.C.; Cherng, F.Y.; Igarashi, T.; Lin, W.C.; Chen, B.Y. EvIcon: Designing High-Usability Icon with Human-in-the-loop Exploration and IconCLIP. In Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2023; Volume 42, p. e14924. [Google Scholar]
Hou, G.; Hu, Y. Designing combinations of pictogram and text size for icons: Effects of text size, pictogram size, and familiarity on older adults’ visual search performance. Hum. Factors 2023, 65, 1577–1595. [Google Scholar] [CrossRef]
Reijnen, E.; Vogt, L.L.; Fiechter, J.P.; Kühne, S.J.; Meister, N.; Venzin, C.; Aebersold, R. Well-designed medical pictograms accelerate search. Appl. Ergon. 2022, 103, 103799. [Google Scholar] [CrossRef]
Xie, J.; Unnikrishnan, D.; Williams, L.; Encinas-Oropesa, A.; Mutnuri, S.; Sharma, N.; Jeffrey, P.; Zhu, B.; Lighterness, P. Influence of domain experience on icon recognition and preferences. Behav. Inf. Technol. 2022, 41, 85–95. [Google Scholar] [CrossRef]
Ding, Y.; Naber, M.; Paffen, C.; Gayet, S.; Van der Stigchel, S. How retaining objects containing multiple features in visual working memory regulates the priority for access to visual awareness. Conscious. Cogn. 2021, 87, 103057. [Google Scholar] [CrossRef] [PubMed]
Alebri, M.; Costanza, E.; Panagiotidou, G.; Brumby, D.P.; Althani, F.; Bovo, R. Visualisations with semantic icons: Assessing engagement with distracting elements. Int. J. Hum.-Comput. Stud. 2024, 191, 103343. [Google Scholar] [CrossRef]
Treisman, A.M.; Gelade, G. A feature-integration theory of attention. Cogn. Psychol. 1980, 12, 97–136. [Google Scholar] [CrossRef] [PubMed]
Quinlan, P.T. Visual feature integration theory: Past, present, and future. Psychol. Bull. 2003, 129, 643. [Google Scholar] [CrossRef]
Bacon, W.F.; Egeth, H.E. Overriding stimulus-driven attentional capture. Percept. Psychophys. 1994, 55, 485–496. [Google Scholar] [CrossRef]
Belopolsky, A.V.; Zwaan, L.; Theeuwes, J.; Kramer, A.F. The size of an attentional window modulates attentional capture by color singletons. Psychon. Bull. Rev. 2007, 14, 934–938. [Google Scholar] [CrossRef]
Yamin, P.A.; Park, J.; Kim, H.K.; Hussain, M. Effects of button colour and background on augmented reality interfaces. Behav. Inf. Technol. 2024, 43, 663–676. [Google Scholar] [CrossRef]
Milne, A.J. Hex Player—A Virtual Musical Controller. In Proceedings of the International Conference on New Interfaces for Musical Expression, Oslo, Norway, 30 May–1 June 2011; pp. 244–247. [Google Scholar]
Yarbus, A.L.; Yarbus, A.L. Eye movements during perception of complex objects. In Eye Movements and Vision; Springer: Boston, MA, USA, 1967; pp. 171–211. [Google Scholar]
Borji, A.; Itti, L. Defending Yarbus: Eye movements reveal observers’ task. J. Vis. 2014, 14, 29. [Google Scholar] [CrossRef]
Jang, Y.M.; Mallipeddi, R.; Lee, S.; Kwak, H.W.; Lee, M. Human intention recognition based on eyeball movement pattern and pupil size variation. Neurocomputing 2014, 128, 421–432. [Google Scholar] [CrossRef]
Joseph MacInnes, W.; Hunt, A.R.; Clarke, A.D.; Dodd, M.D. A generative model of cognitive state from task and eye movements. Cogn. Comput. 2018, 10, 703–717. [Google Scholar] [CrossRef]
Kootstra, T.; Teuwen, J.; Goudsmit, J.; Nijboer, T.; Dodd, M.; Van der Stigchel, S. Machine learning-based classification of viewing behavior using a wide range of statistical oculomotor features. J. Vis. 2020, 20, 1. [Google Scholar] [CrossRef] [PubMed]
Kotseruba, I.; Tsotsos, J.K. Attention for vision-based assistive and automated driving: A review of algorithms and datasets. IEEE Trans. Intell. Transp. Syst. 2022, 23, 19907–19928. [Google Scholar] [CrossRef]
Huang, J.; White, R.W.; Dumais, S. No clicks, no problem: Using cursor movements to understand and improve search. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Vancouver, BC, Canada, 7–12 May 2011; pp. 1225–1234. [Google Scholar]
Raghunath, V.; Braxton, M.O.; Gagnon, S.A.; Brunyé, T.T.; Allison, K.H.; Reisch, L.M.; Weaver, D.L.; Elmore, J.G.; Shapiro, L.G. Mouse cursor movement and eye tracking data as an indicator of pathologists’ attention when viewing digital whole slide images. J. Pathol. Inform. 2012, 3, 43. [Google Scholar] [CrossRef] [PubMed]
Goecks, J.; Shavlik, J. Learning users’ interests by unobtrusively observing their normal behavior. In Proceedings of the 5th International Conference on Intelligent User Interfaces, New Orleans, LA, USA, 9–12 January 2000; pp. 129–132. [Google Scholar]
Xu, H.; Xiong, A. Advances and disturbances in sEMG-based intentions and movements recognition: A review. IEEE Sens. J. 2021, 21, 13019–13028. [Google Scholar] [CrossRef]
He, P.; Jin, M.; Yang, L.; Wei, R.; Liu, Y.; Cai, H.; Liu, H.; Seitz, N.; Butterfass, J.; Hirzinger, G. High performance DSP/FPGA controller for implementation of HIT/DLR dexterous robot hand. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA’04), New Orleans, LA, USA, 26 April–1 May 2004; Volume 4, pp. 3397–3402. [Google Scholar]
Zhang, D.; Chen, X.; Li, S.; Hu, P.; Zhu, X. EMG controlled multifunctional prosthetic hand: Preliminary clinical study and experimental demonstration. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 4670–4675. [Google Scholar]
Zhang, H.; Zhao, Z.; Yu, Y.; Gui, K.; Sheng, X.; Zhu, X. A feasibility study on an intuitive teleoperation system combining IMU with sEMG sensors. In Proceedings of the Intelligent Robotics and Applications: 11th International Conference, ICIRA 2018, Newcastle, NSW, Australia, 9–11 August 2018; Proceedings, Part I 11. Springer: Berlin/Heidelberg, Germany, 2018; pp. 465–474. [Google Scholar]
Buerkle, A.; Eaton, W.; Lohse, N.; Bamber, T.; Ferreira, P. EEG based arm movement intention recognition towards enhanced safety in symbiotic Human-Robot Collaboration. Robot. Comput.-Integr. Manuf. 2021, 70, 102137. [Google Scholar] [CrossRef]
Schreiber, M.A.; Trkov, M.; Merryweather, A. Influence of frequency bands in eeg signal to predict user intent. In Proceedings of the 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER), San Francisco, CA, USA, 20–23 March 2019; pp. 1126–1129. [Google Scholar]
Baruah, M.; Banerjee, B.; Nagar, A.K. Intent prediction in human–human interactions. IEEE Trans. Hum.-Mach. Syst. 2023, 53, 458–463. [Google Scholar] [CrossRef]
Sharma, M.; Chen, S.; Müller, P.; Rekrut, M.; Krüger, A. Implicit Search Intent Recognition using EEG and Eye Tracking: Novel Dataset and Cross-User Prediction. In Proceedings of the 25th International Conference on Multimodal Interaction, Paris, France, 9–13 October 2023; pp. 345–354. [Google Scholar]
Mathis, F.; Williamson, J.; Vaniea, K.; Khamis, M. Rubikauth: Fast and secure authentication in virtual reality. In Proceedings of the Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020; pp. 1–9. [Google Scholar]
Agtzidis, I.; Startsev, M.; Dorr, M. Smooth pursuit detection based on multiple observers. In Proceedings of the Ninth Biennial ACM Symposium on Eye Tracking Research & Applications, Charleston, SC, USA, 14–17 March 2016; pp. 303–306. [Google Scholar]
Meyer, D.E.; Abrams, R.A.; Kornblum, S.; Wright, C.E.; Keith Smith, J. Optimality in human motor performance: Ideal control of rapid aimed movements. Psychol. Rev. 1988, 95, 340. [Google Scholar] [CrossRef]
Galazka, M.A.; Åsberg Johnels, J.; Zürcher, N.R.; Hippolyte, L.; Lemonnier, E.; Billstedt, E.; Gillberg, C.; Hadjikhani, N. Pupillary contagion in autism. Psychol. Sci. 2019, 30, 309–315. [Google Scholar] [CrossRef]
Krejtz, K.; Duchowski, A.; Szmidt, T.; Krejtz, I.; González Perilli, F.; Pires, A.; Vilaro, A.; Villalobos, N. Gaze transition entropy. ACM Trans. Appl. Percept. (TAP) 2015, 13, 1–20. [Google Scholar] [CrossRef]
Sun, Q.; Zhou, Y.; Gong, P.; Zhang, D. Attention Detection Using EEG Signals and Machine Learning: A Review. Mach. Intell. Res. 2025, 22, 219–238. [Google Scholar] [CrossRef]
Acı, Ç.İ.; Kaya, M.; Mishchenko, Y. Distinguishing mental attention states of humans via an EEG-based passive BCI using machine learning methods. Expert Syst. Appl. 2019, 134, 153–166. [Google Scholar] [CrossRef]
Vulpe-Grigorasi, A.; Kren, Z.; Slijepčević, D.; Schmied, R.; Leung, V. Attention performance classification based on eye tracking and machine learning. In Proceedings of the 2024 IEEE 17th International Scientific Conference on Informatics (Informatics), Poprad, Slovakia, 13–15 November 2024; pp. 431–435. [Google Scholar]
Du, N.; Zhou, F.; Pulver, E.M.; Tilbury, D.M.; Robert, L.P.; Pradhan, A.K.; Yang, X.J. Predicting driver takeover performance in conditionally automated driving. Accid. Anal. Prev. 2020, 148, 105748. [Google Scholar] [CrossRef] [PubMed]
Kramer, S.E.; Lorens, A.; Coninx, F.; Zekveld, A.A.; Piotrowska, A.; Skarzynski, H. Processing load during listening: The influence of task characteristics on the pupil response. Lang. Cogn. Process. 2013, 28, 426–442. [Google Scholar] [CrossRef]
Rayner, K. The 35th Sir Frederick Bartlett Lecture: Eye movements and attention in reading, scene perception, and visual search. Q. J. Exp. Psychol. 2009, 62, 1457–1506. [Google Scholar] [CrossRef]
Castelhano, M.S.; Rayner, K. Eye movements during reading, visual search, and scene perception: An overview. In Cognitive and Cultural Influences on Eye Movements; Tianjin People’s Publishing House: Tianjin, China, 2023; pp. 3–34. [Google Scholar]
Binsted, G.; Chua, R.; Helsen, W.; Elliott, D. Eye–hand coordination in goal-directed aiming. Hum. Mov. Sci. 2001, 20, 563–585. [Google Scholar] [CrossRef]
Smith, B.A.; Ho, J.; Ark, W.; Zhai, S. Hand eye coordination patterns in target selection. In Proceedings of the 2000 Symposium on Eye Tracking Research & Applications, Palm Beach Gardens, FL, USA, 6–8 November 2000; pp. 117–122. [Google Scholar]
Chowdhury, M.Z.I.; Turin, T.C. Variable selection strategies and its importance in clinical prediction modelling. Fam. Med. Community Health 2020, 8, e000262. [Google Scholar] [CrossRef]

Figure 1. Research Background.

Figure 2. Experimental setup.

Figure 3. Experimental materials.

Figure 4. Individual trial procedure and overall experimental procedure.

Figure 5. Modeling process.

Figure 6. Comprehensive performance of different machine learning models across various time windows (LR = Logistic Regression; NB = Naïve Bayes; DT = Decision Tree; LDA = Linear Discriminant Analysis; SVM = Support Vector Machine; KNN = K-Nearest Neighbors; RF = Random Forest; GB = Gradient Boosting; Ada = AdaBoost; XG = XGBoost; and NN = Neural Network).

Figure 7. F1-scores and Accuracy of different algorithm models across various time windows.

Figure 8. Receiver Operating Characteristic (ROC) curve and confusion matrix of the best model.

Figure 9. Feature cumulative importance curve.

Figure 10. Feature importance by category.

Figure 11. Prediction F1 Score and Accuracy of gradient boosting using different feature sets for the 3 s window.

Table 1. Feature category and description.

Feature Category	Subcategory	Feature Description	Unit	Count	Selected Count
Eye movement	Fixation behavior	Fixation count	N	1	–
		Fixation rate	N/s	1	1
		Fixation duration (Average, Maximum, Total)	s	3	2
	Saccadic behavior	Saccade count	N	1	–
		Saccade rate	N/s	1	1
		Saccade speed (Average, Maximum)	deg/s	2	1
		Saccade acceleration (Average, Maximum)	deg²/s	2	1
		Saccade amplitude (Average, Maximum)	degree	2	2
	Blink behavior	Blink count	N	1	–
		Blink rate	N/s	1	1
	Interest of area	Transfer entropy [60]	–	1	1
		Static entropy [60]	–	1	1
Hand dynamics	Motion speed	Overall velocity (Average, Maximum)	m/s	2	2
		Velocity of X, Y, Z directions (Average, Maximum)	m/s	6	3
	Motion acceleration	Overall acceleration (Average, Maximum)	m²/s	2	2
		Acceleration of X, Y, Z directions (Avg, Max)	m²/s	6	3
	Others	Motion peak count	N	1	1
Pupil signals	Pupil changes	Pupil diameter change (Average, Maximum)	mm	2	2
Pupil signals		Pupil-iris ratio change (Average, Maximum)	–	2	–
Total				38	24

Table 2. Highest F1-score model and Accuracy for different time windows.

Time Window	Model	Weighted F1-Score	Accuracy
0.5 s	RF	0.7265	0.7368
1.0 s	XG	0.8669	0.8690
1.5 s	XG	0.8444	0.8476
2.0 s	LR	0.8065	0.8097
2.5 s	XG	0.8482	0.8500
3.0 s	GB	0.8835	0.8860
3.5 s	LDA	0.8790	0.8839
4.0 s	GB	0.8657	0.8676

Table 3. Feature importance scores.

Feature	Importance Score
Maximum Acceleration	0.1864
Fixation Rate	0.1731
Mean Saccadic Amplitude	0.1651
Mean Velocity in Z-Direction	0.0716
Total Fixation Duration	0.0548
Maximum Pupil Diameter Difference	0.0427
Maximum Saccadic Amplitude	0.0383
Saccadic Rate	0.0349
Mean Velocity	0.0334
Mean Fixation Duration	0.0330

Table 4. SHAP-based feature disparities across attention modes.

Feature	Bottom-Up	Top-Down	Difference
Maximum Acceleration	−0.0111	0.0111	0.0221
Mean Acceleration	0.0069	−0.0069	−0.0138
Total Fixation Duration	−0.0054	0.0054	0.0109
Saccadic Rate	−0.0052	0.0052	0.0104
Fixation Rate	−0.0038	0.0038	0.0076
Mean Velocity in Z-Direction	−0.0030	0.0030	0.0061
Mean Acceleration in X-Direction	0.0025	−0.0025	−0.0051
Transition Entropy	0.0024	−0.0024	−0.0048
Stationary Entropy	−0.0021	0.0021	0.0042
Blink Rate	−0.0015	0.0015	0.0030
Maximum Velocity	0.0001	−0.0001	−0.0002

Table 5. Top 5 predictive features for each attentional state.

	Feature	Bottom-Up	Feature	Top-Down
1	Mean Acceleration	0.0068	Maximum Acceleration	0.0111
2	Mean Saccadic Amplitude	0.0068	Total Fixation Duration	0.0054
3	Mean Acceleration in X-Direction	0.0025	Saccadic Rate	0.0052
4	Transition Entropy	0.0024	Fixation Rate	0.0038
5	Maximum Saccadic Amplitude	0.0014	Mean Velocity in Z-Direction	0.0030

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Du, X.; Wu, J.; Tang, X.; Lv, X.; Jia, L.; Xue, C. Predicting User Attention States from Multimodal Eye–Hand Data in VR Selection Tasks. Electronics 2025, 14, 2052. https://doi.org/10.3390/electronics14102052

AMA Style

Du X, Wu J, Tang X, Lv X, Jia L, Xue C. Predicting User Attention States from Multimodal Eye–Hand Data in VR Selection Tasks. Electronics. 2025; 14(10):2052. https://doi.org/10.3390/electronics14102052

Chicago/Turabian Style

Du, Xiaoxi, Jinchun Wu, Xinyi Tang, Xiaolei Lv, Lesong Jia, and Chengqi Xue. 2025. "Predicting User Attention States from Multimodal Eye–Hand Data in VR Selection Tasks" Electronics 14, no. 10: 2052. https://doi.org/10.3390/electronics14102052

APA Style

Du, X., Wu, J., Tang, X., Lv, X., Jia, L., & Xue, C. (2025). Predicting User Attention States from Multimodal Eye–Hand Data in VR Selection Tasks. Electronics, 14(10), 2052. https://doi.org/10.3390/electronics14102052

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predicting User Attention States from Multimodal Eye–Hand Data in VR Selection Tasks

Abstract

1. Introduction

2. Related Work

2.1. Intention and Attention

2.2. Cognitive-Related Data and Modeling

3. Dataset

3.1. Participants

3.2. Apparatus and Materials

3.3. Experimental Design

3.4. Experimental Procedure

4. Attention State Prediction Model Development

4.1. Data Preprocessing

4.2. Feature Extraction and Ground Truth

4.3. Model Development

5. Results

5.1. Comparative Performance Across Models

5.2. Effects of Different Time Windows

5.3. Analysis of the Optimal Window and Model

5.4. Impact of Different Feature Categories on Prediction Results

6. Discussion

6.1. Model Performance Comparison

6.2. Effect of Time-Window Size

6.3. Feature Importance

6.4. Differences in Eye–Hand Coordination Across Attentional States

7. Implications and Limitations

7.1. Implications

7.2. Limitations

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI