Next Article in Journal
Assessing Chatbot Acceptance in Policyholder’s Assistance Through the Integration of Explainable Machine Learning and Importance–Performance Map Analysis
Previous Article in Journal
UWB Radar-Based Human Activity Recognition via EWT–Hilbert Spectral Videos and Dual-Path Deep Learning
Previous Article in Special Issue
Automated Generation of Test Scenarios for Autonomous Driving Using LLMs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Analyzing Visual Attention in Virtual Crime Scene Investigations Using Eye-Tracking and VR: Insights for Cognitive Modeling

1
Department of Forensic Science, Central Police University, Taoyuan City 333322, Taiwan
2
Department of Criminal Investigation, Central Police University, Taoyuan City 333322, Taiwan
3
Electrical and Computer Engineering Department, Old Dominion University, Norfolk, VA 23529, USA
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(16), 3265; https://doi.org/10.3390/electronics14163265
Submission received: 27 June 2025 / Revised: 11 August 2025 / Accepted: 15 August 2025 / Published: 17 August 2025
(This article belongs to the Special Issue Autonomous and Connected Vehicles)

Abstract

Understanding human perceptual strategies in high-stakes environments, such as crime scene investigations, is essential for developing cognitive models that reflect expert decision-making. This study presents an immersive experimental framework that utilizes virtual reality (VR) and eye-tracking technologies to capture and analyze visual attention during simulated forensic tasks. A360° panoramic crime scene, constructed using the Nikon KeyMission 360 camera, was integrated into a VR system with HTC Vive and Tobii Pro eye-tracking components. A total of 46 undergraduate students aged 19 to 24–23, from the National University of Singapore in Singapore and 23 from the Central Police University in Taiwan—participated in the study, generating over 2.6 million gaze samples (IRB No. 23-095-B). The collected eye-tracking data were analyzed using statistical summarization, temporal alignment techniques (Earth Mover’s Distance and Needleman-Wunsch algorithms), and machine learning models, including K-means clustering, random forest regression, and support vector machines (SVMs). Clustering achieved a classification accuracy of 78.26%, revealing distinct visual behavior patterns across participant groups. Proficiency prediction models reached optimal performance with a random forest regression ( R 2 = 0.7034), highlighting scan-path variability and fixation regularity as key predictive features. These findings demonstrate that eye-tracking metrics—particularly sequence-alignment-based features—can effectively capture differences linked to both experiential training and cultural context. Beyond its immediate forensic relevance, the study contributes a structured methodology for encoding visual attention strategies into analyzable formats, offering valuable insights for cognitive modeling, training systems, and human-centered design in future perceptual intelligence applications. Furthermore, our work advances the development of autonomous vehicles by modeling how humans visually interpret complex and potentially hazardous environments. By examining expert and novice gaze patterns during simulated forensic investigations, we provide insights that can inform the design of autonomous systems required to make rapid, safety-critical decisions in similarly unstructured settings. The extraction of human-like visual attention strategies not only enhances scene understanding, anomaly detection, and risk assessment in autonomous driving scenarios, but also supports accelerated learning of response patterns for rare, dangerous, or otherwise exceptional conditions—enabling autonomous driving systems to better anticipate and manage unexpected real-world challenges.

1. Introduction

Due to the incomplete records on handling emergency cases such as crime scenes, fire scenes, major surgeries, or operational zones, it is very difficult to develop specialized robots. To train specialized robots to quickly observe and respond in emergency environments, it is essential to understand how human experts rapidly observe and analyze in such emergency environments. Therefore, this research designs a process of analysis to test human experts and novices in immersive environments, examining the differences in their focuses of observation and analysis during work. This will enable the development of expertise among experts to train robots quickly in the future. We take crime scene investigation as an example; this successful model can be applied to various emergency or special environments in the future to automatically acquire human learning experiences, thereby facilitating education, machine learning, and robot development.
The Oculus virtual reality (VR) device was used to simulate the crime scene, and the Tobii eye-tracking device was used to capture the observations and focus of the experimental participants. Fifty students from the National University of Singapore in Singapore and the Central Police University in Taiwan participated in our simulation experiment. Then, SVM classification was applied to analyze the eye-tracking results of the experimental participants. After the results from participants in different areas and with different proficiency levels are obtained, we can use the adaptive results in robot learning.
Specialized robots are critical in managing emergencies like crime scenes, fires, surgeries, or operational zones, where quick and precise actions are needed. However, developing these robots is challenging because detailed records are lacking in high-pressure environments. Emergencies are unpredictable and vary a lot, which makes it hard to gather the comprehensive data needed to train robots to effectively handle these scenarios. The shortage of detailed data on how experts handle emergencies creates a significant gap in the methods used to train robots. Usually, training robots involves a lot of data from well-documented scenarios. However, in emergencies, the missing details about expert actions make it difficult to teach robots the subtle skills they need to work well under stress.
This research aims to fill this gap using the skills of human professionals who regularly handle emergencies. These crime scene professionals’ educational and certification requirements vary widely across jurisdictions [1]. As such, we hope to gain important insights into how these experts make decisions and understand situations by studying how they observe and react in simulated high-stress environments and store them as a significant database of experience. The purpose of this work is to convert the knowledge gained from human experts into data that can be used to train robots more effectively. We use advanced technology, Virtual Reality (VR) [2], to recreate the crime scene, and eye-tracking technology [3] to observe and record how participants look around and focus during the simulation. Virtual reality is a cutting-edge technology that utilizes computers to generate realistic objects and scenes that immerse users in their surroundings. This generated environment is usually perceived through a virtual reality headset or helmet. Eye-tracking technology is an emerging technology that is used to monitor the user’s eye movements or an individual’s focus. It is the process of measuring an individual’s point of gaze or eye position, collecting the individual’s eye characteristics, and recording them in the form of data, providing composite statistics on the number of gazes, the first gaze, and the duration of the gaze.
To better situate our study within the broader research landscape, we review the relevant literature across four interrelated domains: human–robot interaction, robot training, eye-tracking technologies, and machine learning applications in visual behavior modeling. Furthermore, our work contributes to the development of autonomous vehicles by modeling how humans visually interpret complex and potentially hazardous environments. Understanding expert and novice gaze patterns during simulated forensic investigations can inform the design of autonomous systems that must make rapid, safety-critical decisions in similarly unstructured settings. The extraction of human-like visual attention strategies has direct implications for improving scene understanding, anomaly detection, and risk assessment in autonomous driving scenarios.

1.1. Human-Robot Interaction

Human–robot interaction (HRI) research spans a wide range of methodologies and focus areas. Some studies have emphasized psycho-physiological measures [4], while others address broader perception systems [5], single-platform analysis [6], or specific behavioral modeling [7]. Within computer vision, major progress has been made in gesture-based interaction [8], multimodal perception [9], hand gesture recognition [10,11,12], and human motion capture [13]. These contributions help build intuitive and context-aware robot systems capable of adapting to subtle human cues. Vision-based methods for robotics are also evolving rapidly. Surveys highlight progress in robotic vision systems [14,15], learning-based perception [16], and aerial vision systems [17]. Broader applications of robotic vision include object recognition, scene reconstruction, manipulation, and autonomous navigation [1,18,19], all of which rely on fine-grained visual understanding. Recent studies further highlight emerging directions in HRI: LLM-guided robot systems now allow for intuitive behavior control through language and examples [20]; cross-cultural factors such as trust, privacy, and anthropomorphism shape consumer acceptance of social home robots [21]; and generative AI models like ChatGPT-4 raise new ethical and cognitive challenges in anthropomorphic HRI design [22].

1.2. Robot Training

Robot training approaches have advanced from supervised pre-training to self-supervised and end-to-end methods. Semantic pretraining for downstream tasks is demonstrated in works such as those of Lin et al. [23] and Shridhar et al. [24], who leverage CLIP models [25] for language-conditioned imitation. Nair et al. [26] combine time-contrastive and language-based learning, while self-supervised methods focus on dynamics [27], interaction-driven visual representation [28], grasping [29,30], and deep visual encoders [31]. Temporal representations and video analysis also play key roles [32,33], as do visual correspondences [34], retrieval-based learning [35], and learning through demonstration [36]. End-to-end control pipelines [37,38,39] replace traditional components such as pose estimation [40], grasp planning [41], and motion planning [42]. Recent works bridge pipelined and end-to-end learning to improve sample efficiency [43,44,45]. Moreover, explainable AI (XAI) methods [46,47,48,49,50] enable robots to learn efficiently without requiring repeated or explicit instruction, enhancing human trust and usability.

1.3. Eye-Tracking Technology and Applications

Eye-tracking enables quantitative analysis of human attention and decision-making. It has been applied in medicine [51,52,53,54], sports training, and education to identify proficiency and cognitive load. Gaze-based comparisons of experts vs. novices, such as in pathology or surgical tasks [55,56], reveal distinct visual strategies that can inform training. The integration of eye-tracking with VR has proven especially valuable for simulation-based learning [57,58]. Forensic applications include document inspection [59], latent print analysis [60], and spatial reasoning in crime scenes [57,61,62]. In addition, sensor-level precision remains critical in ensuring the fidelity of eye-tracking and head-orientation measurements. Recent improvements in MEMS gyroscope modeling and compensation strategies have shown promising results in reducing motion-related drift and noise [63]. These studies emphasize that gaze behavior reflects not only experience but also situational awareness and real-time prioritization under pressure.

1.4. Machine Learning Applications in Eye-Tracking Research

Machine learning methods have become central in extracting meaningful features from eye-tracking data. Ahmed et al. [64] used neural networks to diagnose autism via gaze patterns. Gaze heatmaps have been used to classify abnormalities [65], predict radiologist errors [66,67], and assess fatigue-related variability [68]. Other studies explore gaze-guided histopathology annotation [69], surgery skill evaluation [70,71], error prediction [72], and gaze-based CNN training for segmentation tasks [73,74]. These diverse applications underscore the potential of gaze data as a high-dimensional, behaviorally grounded input for both interpretive modeling and predictive analytics. In [75], the authors propose a systematic review that summarizes how ML models classify eye movements and predict attention and reading behavior across various cognitive tasks. In [76], the researchers show that a VR Eye-tracking Cognitive Assessment (VECA) tool demonstrates high accuracy in detecting mild cognitive impairment using support vector regression on gaze features.

1.5. Research Objectives and Hypotheses

To address the challenges of modeling perceptual expertise in high-stakes environments, this study investigates how eye-tracking metrics—particularly temporal alignment features—can reveal differences in visual strategies across participants with varying backgrounds and levels of expertise.
We propose the following testable hypotheses:
Hypothesis 1.
Eye-tracking temporal alignment features (e.g., scan-path similarity measures) can reliably distinguish between novice and expert participants engaged in simulated crime scene analysis, with participants’ perceptual proficiency serving as a key indicator of expertise.
Hypothesis 2.
Differences in visual scanning strategies may arise from a combination of cultural/geographic background and domain-specific training or institutional affiliation. These differences can be quantitatively captured using temporal-sequence gaze metrics.
These hypotheses serve as the foundation for our machine learning analyses and inform the evaluation of group-level differences in perceptual behavior. We note that while the regional distinction (Singapore vs. Taiwan) serves as a primary grouping factor, participants also differ in proficiency, which may contribute to the observed behavioral differences.
This study recruited forty-six participants, 19–24-year-old undergraduate students from the National University of Singapore in Singapore and the Central Police University in Taiwan, who participated in a simulated crime scene investigation while their eye movements and fixations were recorded. We applied multiple machine learning techniques, including K-means clustering for region differentiation, SVM classification for skill-level prediction, and random forest regression for proficiency modeling, to reveal how gaze behavior varies by region and expertise. Our main contributions are as follows: (1) we show how specific fixation patterns and fixation statistics predict the performance of novices and experts; (2) we demonstrate that eye-tracking metrics and alignment distance features can reliably distinguish between Singaporean and Taiwanese participants based on their demonstrated professional performance; and (3) we provide a data-driven framework to extract the visual strategies of experts. We expect that this approach can help improve the effectiveness of training robots, thereby significantly improving their ability to respond in emergencies and impact education and improving robotics development by providing a model for quickly and effectively training robots in various dangerous situations.

2. Methods

In this section, we describe the methodological framework used to investigate perceptual and behavioral differences during immersive forensic scene analysis. The methodology encompasses the construction of a virtual crime scene environment, participant recruitment, eye-tracking data collection, and a comprehensive pipeline for data preprocessing and feature extraction. In addition, we outline the machine learning models applied to identify regional and proficiency-based variations, along with the evaluation metrics used to assess model performance. This multi-stage process ensures both the reliability of gaze-based metrics and the interpretability of the resulting models. A summary of the overall methodology is illustrated in Figure 1, which outlines the key stages from experiment setup and data collection to model evaluation.

2.1. Simulated Crime Scene Setup and Data Collection

In this study, we employed the Nikon KeyMission 360 panoramic camera to simulate and capture images of a simulated crime scene (an example is shown in Figure 2). The resulting 360 photographs were then integrated into an immersive virtual reality (VR) environment using the Tobii Pro VR Integration system, which is based on the HTC Vive platform and equipped with eye-tracking capabilities.
The primary objective was to evaluate whether students from two distinct regions demonstrate significant differences in their analytical approaches to crime scene interpretation from eye-tracking information. Furthermore, the study sought to identify specific patterns of variation between experienced and inexperienced participants. These insights are intended to inform the development of future machine learning models aimed at enhancing crime scene analysis through behavioral and perceptual data.

2.2. Dataset Description and Preprocessing

2.2.1. Source of the Dataset

The complete dataset comprises eye-tracking recordings acquired from two distinct academic institutions. The first 23 participants (datasets T01 through T23) were enrolled at the National University of Singapore, while the remaining 23 participants (datasets T24 through T46) were affiliated with Central Police University (Taiwan IRB no. 23-095-B). Each participant performed a standardized set of visual tasks under controlled laboratory conditions. Raw eye-tracking logs for every individual were stored in MATLAB (R2023b)-compatible format (.mat files), capturing gaze event durations, fixation coordinates (X, Y), and eye-movement classifications as generated by the eye-tracker’s native software.

2.2.2. Composition of the Dataset

There are a total of 46 unique datasets, which are labeled sequentially from T01 to T46. Each dataset contains the following:
Dataset name (TXX): A unique identifier assigned to each participant.
Number of records: The total count of recorded eye-tracking samples for that participant. Across all subjects, individual record counts range roughly from 19,111 to 69,839 samples. For example, T03 comprises 19,111 entries, whereas T46 contains 69,839 entries.
Participant identifier: A coded label (e.g., “04 Shermaine”, “07 Phoebe”, and “CPU01 XinBang”) that preserves anonymity while allowing cross-referencing with proficiency scores.
Proficiency score: A decimal value between 0.38 and 0.72 represents each participant’s task-specific skill level obtained from standardized proficiency assessments before eye-tracking. Note that dataset T04 (participant “14 Tan”) lacks a recorded proficiency value (NaN) and was subsequently excluded from proficiency-based analyses.
Affiliated institution: This indicates whether the participant belonged to the National University of Singapore (datasets T01–T23) or Central Police University (datasets T24–T46).
Each raw dataset contains multiple data fields exported directly from the eye-tracker, including the following: Gaze Event Duration, the duration (in milliseconds) of each fixation event; Fixation Point X and Y Coordinates, two spatial channels denoting horizontal and vertical eye-position in screen coordinates; and Eye Movement Type, a categorical label (e.g., “fixation”, “saccade”, and “blink”) representing the eye-movement classification assigned by the tracker’s internal algorithm.
Collectively, these 46 datasets encompass a total of approximately 2.6 million individual gaze samples, balanced evenly between Singaporean and Taiwanese cohorts.

2.2.3. Preprocessing Steps and Data Cleaning

Before feature extraction and downstream analysis, all raw eye-tracking data underwent a standardized preprocessing pipeline, as follows:
Key parameters extraction: Each participant’s .mat file was imported into a consistent MATLAB-compatible structure. Then, the core columns were extracted for subsequent processing—for example, gaze event duration, fixation X-coordinate, fixation Y-coordinate, or eye-movement type.
Handling missing or invalid entries: Any rows in which fixation coordinates (X or Y) were NaN or fell outside the display’s valid range were excluded, as NaN will report errors or produce incorrect results when training deep learning models. After that, we removed samples with non-physiological gaze durations (e.g., durations <50 ms or >2000 ms) to mitigate spurious fixations or recording artifacts. According to [3], fixations shorter than 50 ms are unlikely to reflect meaningful cognitive processing and are often the result of signal noise or microsaccades. Similarly, durations longer than 2000 ms are rare and may reflect tracker loss, blinks, or other non-perceptual pauses. Including such extreme values may distort statistical distributions and affect the validity of gaze-based inferences. Since the T04 dataset lacked critical information, we flagged the entire dataset T04 (participant “14 Tan”) for missing proficiency; T04 was omitted from analyses that rely on proficiency as an independent variable.
Derivation of composite features: The Euclidean distance traveled between consecutive fixations was computed by applying
D i = ( x i x i 1 ) 2 + ( y i y i 1 ) 2
for each valid pair of fixation points ( x i 1 , y i 1 ) and ( x i , y i ) . This “fixation-distance” metric serves as a proxy for scan-path amplitude.
Normalization and label standardization: All proficiency scores were in a uniform decimal format. If the units did not match, then we performed a pre-conversion operation. We also standardized participant identifiers, removed extraneous whitespace, and ensured consistent formatting (e.g., “CPU01 Xinbang” rather than “CPU01Xinbang”, corresponding to the “04 Shermaine” participant identifier with space delimiters). To ensure consistent parameter label names when training the model, we harmonized eye-movement type labels across both institutions (e.g., mapping “Fix,” “F” or “fixation” uniformly to “Fixation”).
Data format consistency checks: We verified that each subject’s record count matched the value reported in the “Number of Records” column of the dataset. Ensuring all timestamps (if present in the raw logs) were converted to a single time base (milliseconds from task onset) was also necessary. Then, we confirmed that all fields were exported in numeric format (double precision for coordinates and durations; integer or categorical encoding for eye-movement types) so that running the program would not provide error information.
Following these procedures, the resulting clean datasets were ready for exploratory analysis, feature-level clustering, and proficiency-based modeling.

2.3. Analytical Framework

We conducted two main lines of analysis: regional difference modeling and proficiency-level prediction. For both, we applied a variety of machine learning models, including unsupervised clustering, classification, and regression approaches.

2.3.1. Regional Differences Analysis

We first extracted summary statistics from two sets of features to investigate whether eye-tracking patterns differ systematically between participants in Singapore (datasets T01–T23) and Taiwan (datasets T24–T46). We computed four metrics from each participant: maximum, mean, minimum, and standard deviation, for both GazeEventDuration and EuclideanDistance (derived from Fixation X and Fixation Y). These eight resulting values formed the “gaze-duration/scan-path” feature vector.
Next, we applied two alignment-based distance measures, Earth Mover’s Distance (EMD) [77] and Needleman–Wunsch (N-W) [78], to all pairs of participants within each region. Each algorithm produced a 23 × 23 inter-participant distance matrix (one per region), which we then converted into participant-level features by computing the same four summary statistics (max, mean, min, STD) for each 23 × 23 matrix row. After performing this conversion for both EMD and N-W matrices, we obtained eight additional features (four from EMD, four from N-W). By concatenating these eight “alignment-distance” features with the original eight “gaze/scan” features, each participant was represented by a single 16-dimensional feature vector (total size: 46 participants × 16 features).

2.3.2. Proficiency Prediction

In this study, “proficiency” refers to each participant’s demonstrated task skill level, quantified via a standardized assessment that yields a decimal score between 0 and 1. Within eye-tracking research, proficiency serves as the primary outcome variable: by correlating gaze-based metrics (e.g., fixation durations, saccadic amplitudes) with proficiency scores, one can infer how visual attention patterns reflect underlying task expertise. A higher proficiency score implies more efficient or targeted gaze behaviors, such as shorter fixation variability or optimized scan-path distances. In contrast, lower scores often coincide with irregular or less focused eye movement patterns. Accordingly, proficiency not only validates the ecological relevance of eye-tracking features but also guides feature selection when constructing predictive models.

2.3.3. Evaluation Metrics

Model performance was evaluated using standard regression metrics, including Mean Squared Error (MSE) and the Coefficient of Determination ( R 2 ), computed on a held-out test set to assess generalization accuracy. In addition to the final test results, we employed 10-fold cross-validation during the training and validation phases to obtain more robust performance estimates and mitigate the impact of sample variability. This approach ensured that the models were not overfitting to specific subsets of the data and allowed for a more reliable comparison across different algorithms.
Furthermore, we conducted a feature importance analysis to identify which gaze-related variables contributed most significantly to proficiency prediction. By leveraging model-specific importance rankings (e.g., from Random Forests) and cross-model consistency, we evaluated the relative predictive power of features such as fixation duration variability, scan-path entropy, and gaze dispersion. These analyses offer valuable insight into the perceptual cues that distinguish expert and novice behavior within immersive forensic tasks.

3. Results

This section presents the key findings derived from the analysis methods described previously. We report results from two primary analytical perspectives: (1) regional variation in gaze behavior between participants from different geographic and institutional backgrounds, and (2) proficiency-based differences in visual attention and cognitive strategies. The outcomes of clustering, classification, and regression models are discussed alongside relevant evaluation metrics to assess the effectiveness of eye-tracking data in modeling human perceptual performance.

3.1. Regional Difference Analysis Results

To examine potential regional differences in visual behavior, we first applied unsupervised clustering methods to explore whether gaze-based features naturally differentiate participants from the two cohorts. This is followed by supervised classification to validate and interpret key features contributing to group separation. The results from each modeling stage are presented below.

3.1.1. Clustering Outcome and Group Separation

We then applied K-means clustering (k = 2) to the 46 × 16 feature matrix to recover two clusters that ideally correspond to the two geographic cohorts. After assigning each participant to one of the two clusters (labeled “1” or “2” by the algorithm), we compared cluster membership against the accurate region label (1 = Singapore, 2 = Taiwan). The resulting confusion matrix is shown in Table 1, where the rows indicate the true region and the columns indicate the predicted cluster:
From this matrix, we observe that 14 out of 23 Singapore participants were correctly clustered (cluster 1), while 9 were misassigned to cluster 2. Similarly, 22 out of 23 Taiwan participants were correctly placed in cluster 2, with only a single participant misclustered. In total, the clustering algorithm achieved an overall accuracy of
14 + 22 46 = 0.7826
An accuracy of 78.26% indicates that our 16-feature representation (combining raw summary statistics and alignment-distance statistics) captures substantial regional differences in eye-tracking behavior. The high correct-classification rate for Taiwan (22/23) suggests that Taiwanese participants exhibit more consistently distinct patterns, whereas Singapore participants show somewhat greater intra-group variability, evidenced by nine misclassifications. These misclustered cases likely reflect individual differences in task strategy or data quality rather than a complete absence of regional effects.
Overall, this K-means clustering analysis supports the hypothesis that, when summarized through gaze-event/scan-path metrics and EMD/N-W distance features, eye-tracking data from Singapore and Taiwan tend to form two separable clusters. Future work could explore alternative clustering methods (e.g., hierarchical clustering or Gaussian mixture models) or incorporate additional temporal features to reduce the misclassification rate further.

3.1.2. Feature Importance Analysis and Key Findings

To understand which features most strongly drive the separation between Singaporean (T01–T23) and Taiwanese (T24–T46) participants, we trained a random forest classifier on the 15-feature set (combining raw gaze-duration and Euclidean-distance statistics and the EMD and N-W alignment distances). The EuclideanDistance_min was removed due to its having all zero values. Figure 3 displays the resulting feature importances, ranked from highest to lowest.
EMD_std (standard deviation of Earth Mover’s Distances) is the most critical feature, accounting for roughly 38% of the total importance. This indicates that variability in pairwise alignment costs, how uniformly or unevenly participants’ eye-tracking sequences resemble others within their region, provides the strongest regional signal. EMD_min (minimum Earth Mover’s Distance to any same-region counterpart) is the second-most important feature (around 20%). A lower EMD_min suggests that a participant’s scan pattern closely matches at least one fellow countryman, reinforcing that intraregional eye-movement similarity is a key discriminator. EMD_max (maximum EMD) also contributes substantially (around 8%), implying that the furthest alignment gap (i.e., the largest dissimilarity to another same-region participant) still carries regional information. Among the Needleman–Wunsch features, NW_std (standard deviation of N-W alignment scores) emerges next (around 7%), followed by NW_min (around 3%) and NW_mean (around 2%). The maximum N-W distance (NW_max) is minimally informative (<1%). These four EMD statistics collectively account for nearly two-thirds of the total importance. The similarly high ranking of NW_std underlines that both alignment-based measures capture complementary nonlinear temporal patterns in how participants scan visual stimuli.
GazeEventDuration_std (the standard deviation of fixation durations) has approximately 5% importance. A higher GazeEventDuration_std suggests more variable fixation lengths, which may reflect differences in attentional strategy between regions. GazeEventDuration_max, GazeEventDuration_mean, and GazeEventDuration_min each contribute roughly 2–3%. These raw fixation metrics play a secondary role: they reinforce that, although absolute fixation lengths differ slightly by region, they are not as discriminative as sequence-alignment distances. The remaining features have minimal impact, and their influence on the classification is insignificant.

3.2. Proficiency Prediction Results

Beyond regional group analysis, we also investigated how gaze behavior reflects participants’ proficiency levels during the simulated investigation task. Using labeled proficiency scores, we trained a set of regression models to predict individual performance based on key eye-tracking features. The following subsection compares the performance of these models and identifies which learning approach yields the most accurate predictions.

3.2.1. Machine Learning Models Applied for Proficiency Prediction

For each sample in the dataset, we calculated the maximum, mean, minimum, and standard deviation (STD) of three primary metrics, GazeEventDuration, EuclideanDistance, and EyeMovementTypeIndex, and used these statistics as candidate features. However, because the minimum value of EuclideanDistance is always zero and the minimum value of EyeMovementTypeIndex is always one, we excluded those two minimum parameters from model training. Consequently, this step yielded ten distinct features (four statistics × three metrics minus two trivial minima).
In addition, the EyeMovementType variable comprises four categories: Saccade, Unclassified, Fixation, and EyesNotFound. For each sample, we counted the total occurrences of each category and included these four counts as additional features. As a result, the model initially employed fourteen parameters (ten from the summary statistics plus four category counts). Given that our sample size is relatively small (only 46 observations), we further augmented the feature set by introducing the squared values of these fourteen parameters. By doing so, we expanded the feature space to twenty-eight parameters, thereby providing the machine-learning models with additional nonlinear information to improve predictive performance. To predict continuous proficiency scores from gaze metrics, we first applied Progressive Regression (i.e., stepwise linear regression) to identify the most informative subset of features. The indices selected by progressive regression were: [0, 1, 2, 5, 6, 13, 18, 20, 21, 22, 23, 24, 25, 27], which correspond to the following derived features: GazeEventDuration_mean, GazeEventDuration_std, GazeEventDuration_max, EuclideanDistance_std, EuclideanDistance_max, EyeMovementType_EyesNotFound, EuclideanDistance_mean2, EuclideanDistance_max2, EyeMovementTypeIndex_mean2, EyeMovementTypeIndex_std2, EyeMovementTypeIndex_max2, EyeMovementType_Saccade2, EyeMovementType_Unclassified2, EyeMovementType_EyesNotFound2.
Using this reduced feature set, we trained and evaluated five distinct algorithms: Linear Regression, Polynomial Regression (degree 2), Random Forest, Decision Tree, and Support Vector Regression (SVR, RBF kernel). More experimental details are presented in Section 3.2.2.

3.2.2. Evaluation Metrics and Performance Comparison

All models were assessed via Mean Squared Error (MSE) and Coefficient of Determination ( R 2 ) on a held-out test set, and the experimental results are shown in Table 2.
Based on the results in Table 2, we find that the coefficients obtained after linear regression model training are: [0.077, −0.098, 0.066, 0.013, −0.844, 0.029, −0.005, 0.828, −0.149, 0.233, −0.043, 0.032,−0.043, −0.018], and the intercept is 0.5810. After feature selection, residuals (Figure 4) exhibit reduced heteroscedasticity. However, linearity assumptions limit explanatory power (low R 2 ).
We present the results through residual plots and feature importance charts to further analyze the performance of different machine learning models and the importance of various features.
Figure 4 shows the distribution of residuals (residual = predicted value − actual value) for linear regression, polynomial regression, random forest, decision tree, and SVR after the introduction of progressive regression. The distribution of the residuals (Residuals = Predicted − True) for each progressive regression, with the actual value on the horizontal axis, the residuals on the vertical axis, and the red dashed line indicating the ideal baseline where the residuals are zero.
Linear regression: Residuals form a “funnel” pattern: when actual proficiency is low (around 0.40–0.45), residuals are predominantly negative; as actual values increase, residuals shift positive and exhibit a larger spread. This indicates systematic bias (heteroscedasticity): high-proficiency samples are under-predicted, and low-proficiency samples are slightly over-predicted.
Polynomial regression: Most residuals lie near zero, but a few extreme positive outliers reach nearly +0.80. This pattern suggests strong overfitting: the model fits the training data nearly perfectly (residual around 0) but fails to generalize for specific test points, resulting in significant outliers.
Random forest: Residuals are tightly clustered around zero, with nearly all points within ±0.05. This indicates that Random Forest captures nonlinear relationships well and minimizes systematic and random errors, making it the best-performing model.
Decision tree: Residuals are mostly within ± 0.05 , but there are some larger errors: harmful residuals (around −0.17) when actual scores are low (0.40–0.45) and positive residuals (around +0.18) show around mid-to-high proficiency (around 0.60). A single Decision Tree tends to overfit certain regions and underfit others, leading to localized spikes in error.
SVR: Residuals resemble those of Linear Regression: negative bias when scores are low and positive bias when scores are high. The spread is larger (up to ± 0.12 ). Although SVR can model some nonlinearities, with the given features and sample size, it still underpredicts high-proficiency samples and exhibits higher variance in residuals.
The random forest has the most concentrated and near-zero residuals, indicating that it is the best for fitting ability and generalization performance. Decision trees are the next best, but still have significant regional errors. The residuals of linear regression and SVR show a significantly skewed distribution, indicating that purely linear or kernel mapping is not a good enough fit for the feature set of this study. Polynomial regression showed extreme outlier residuals, tending to overfit the training set and generalize the worst.
In Figure 5, the horizontal axis shows the coefficient values. The vertical axis lists the 14 features selected by Progressive Regression, including the original metrics and their squared versions. Darker colors correspond to larger absolute coefficient values, while lighter colors indicate smaller absolute coefficients. EuclideanDistance_max2 (Maximum Euclidean Distance squared) is the strongest positive feature, with a coefficient of +0.82, the most significant positive value. This result suggests that a larger maximum fixation-span (nonlinearly emphasized by squaring) strongly increases predicted proficiency, implying that more proficient users tend to exhibit wider visual scanning behavior. EyeMovementTypeIndex_std2 (Standard deviation of eye-movement-type index squared) shows secondary positive contributors, with a coefficient of +0.25. GazeEventDuration_mean (Mean fixation duration) and GazeEventDuration_max (Maximum fixation duration) possessed the third and fourth positive correlations with coefficients of +0.07 and +0.03, respectively. We can make the following inferences from the results: variability in eye-movement patterns (when squared) and longer or more stable fixation durations are associated with higher proficiency, though their impact is more minor.
In addition, EuclideanDistance_std (Standard deviation of Euclidean Distance) has the smallest absolute coefficient value, at around −0.01. EuclideanDistance_mean2, EyeMovementTypeIndex_max2, EyeMovementType_Unclassified2, EyeMovementType_ EyesNotFound2 also have small negative coefficients (from −0.001 to −0.02). It is observed that larger path variability (EuclideanDistance_std) slightly reduces predicted proficiency. Certain squared features (e.g., large average distance, unclassified eye-movement types, or missing-eyes events) have minor negative effects when they become large.
The two dominant features are EuclideanDistance_max2 and EyeMovementTypeIndex_std2; they account for the majority of the model’s explanatory power. The remaining features have relatively small coefficients, indicating minor incremental contributions once primary features are accounted for.
In Figure 6, the horizontal axis shows the importance score (nonnegative) assigned by the Random Forest. The vertical axis ranks all features (original and squared), sorted by their importance from highest to lowest. Colors transition from deep blue (highest importance) through lighter shades to red (lowest importance), highlighting relative ranking. In the figure, we can observe the top three features, which are EuclideanDistance_std2 (standard deviation of Euclidean Distance squared) with the importance of around 0.085 (highest), GazeEventDuration_mean (mean fixation duration) with the importance of around 0.083, and EuclideanDistance_std (standard deviation of Euclidean Distance) with an importance of around 0.082. In contrast with linear regression, EuclideanDistance_std2 had a small negative coefficient. In random forest, variability in scan path (nonlinear emphasis via squaring) is the strongest discriminator between proficiency levels. The high importance of GazeEventDuration_mean also indicates that stable average fixation length strongly correlates with higher proficiency. The third highest EuclideanDistance_std also confirms that the path variability (without squaring) remains highly informative, indicating that random forest leverages this metric’s linear and nonlinear forms.
Secondary important features include EyeMovementType_Fixation (proportion of time spent fixating, importance of around 0.062), GazeEventDuration_mean2 (squared mean fixation duration, importance of around 0.058), and EyeMovementType_Fixation2 (squared fixation proportion, importance of around 0.055). These features emphasize that higher proportions of gaze and the nonlinear effect of gaze duration help the model distinguish between proficiency levels. Other mid-tier contributors (between 0.045 and 0.050) indicate irregular fixation durations and unclassified/missing-eyes events provide additional nuance, enabling the model to capture subtle proficiency differences. For the lower but non-zero importance features, though they are less critical than the top features, these metrics still contribute when the stronger signals alone cannot fully partition complex samples.
Figure 7 shows the feature correlation heatmap, which reveals several strong positive correlations among the statistical summaries of the same metric. For example, GazeEventDuration_mean, GazeEventDuration_std, and GazeEventDuration_max exhibit coefficients above 0.85, indicating that these features broadly capture overlapping information. Similarly, EuclideanDistance_mean, EuclideanDistance_std, and EuclideanDistance_max are highly interrelated (correlations near 0.90), while their squared counterparts correlate almost perfectly with the original values (e.g., EuclideanDistance_std2 vs. EuclideanDistance_std around 0.99). In the case of EyeMovementTypeIndex, mean, std, and max also co-vary strongly (coefficients >0.95), which suggests redundancy among those features. By contrast, cross-feature correlations between different metrics (e.g., between GazeEventDuration_max and EuclideanDistance_std) are generally moderate (0.30–0.50) or low, indicating that combining gaze-duration and scan-path variability measures may still provide complementary information. Finally, negative correlations appear between specific eye-movement categories (for instance, EyeMovementType_Fixation vs. EyeMovementType_Saccade of coefficients around −0.50), reflecting the mutually exclusive nature of those classifications. This heatmap highlights potential multicollinearity among summary statistics of the same underlying measure, suggesting that dimensionality reduction or feature selection could improve model stability without sacrificing predictive power.
To investigate whether eye-tracking metrics can predict participant proficiency, we trained Support Vector Machine (SVM) classifiers under two conditions: (1) using only the original features and (2) using the original features plus their squared terms. In each case, we applied 10-fold cross-validation, tuned the hyperparameters C 0.1 , 1 , 10 , γ s c a l e , a u t o , and k e r n e l s { l i n e a r , p o l y , R B F } , and evaluated classification accuracy on both 3-class (proficiency bins: [0.3833, 0.50], ( 0.50, 0.64], (0.64, 1.0]) and 4-class (bins: [0.3833, 0.50], (0.50, 0.60], (0.60, 0.70], (0.70, 1.0]) schemes. The resulting classification performances under each configuration are summarized in Table 3.
Table 4 presents the detailed 10-fold cross-validation results for the optimal SVM classifiers in both three-class and four-class proficiency classification. The scores indicate the classification accuracy achieved on each fold, reflecting the model’s generalizability. The average accuracy across the 10 folds was 0.7250 for the three-class setup and 0.5900 for the four-class setup.
When using only the original metrics, the best three-class result (72.50% accuracy) was achieved by the feature set [GazeEventDuration_mean, EyeMovementType_Saccade, EyeMovementType_Unclassified, EyeMovementType_Fixation] with C = 10, γ = ‘scale’, and kernel = ‘RBF’. With C = 10 and an RBF kernel, the SVM effectively separated participants into three discrete proficiency levels, achieving a relatively high accuracy. The fact that these four “unsquared” features suffice suggests that, for coarse ternary grouping, linear relationships between raw gaze metrics and proficiency already capture most of the discriminatory information. In the four-class setting, accuracy reached 55.50% with the quartet [GazeEventDuration_mean, EuclideanDistance_mean, EyeMovementType_Unclassified, EyeMovementType_EyesNotFound] under C = 10, γ = ‘auto’, and kernel = ‘RBF’. The reduced accuracy suggests that the original features alone struggle to distinguish among the four adjacent proficiency categories.
After augmenting the feature space with each metric’s squared term, the optimal three-class feature combination remained unchanged and yielded 72.50% accuracy (same hyperparameters). In other words, the added squared terms do not improve or alter the model’s performance when dividing the dataset into three broad proficiency levels. However, adding squared terms improved overall accuracy to 59.00% for four-class classification. The new best four-feature set became [EuclideanDistanc_std, EyeMovementType_EyesNotFound, EuclideanDistance_std2, EyeMovementTypeIndex_mean2] with C = 1, γ = ‘auto’, and k e r n e l = ‘RBF’. These results indicate that squared terms do not benefit the three-class task. They can provide additional discriminative power in the more fine-grained four-class scenario. Note also that the optimal regularization parameter falls from C = 10 to C = 1, suggesting that including squared features reduces overfitting and allows the SVM to operate with stronger regularization.

4. Conclusions

In response to the growing need for the efficient transfer of human expertise in high-stakes settings, we developed an immersive VR framework that combines 360° panoramic crime scenes on the HTC Vive platform with Tobii Pro eye-tracking technology to capture and analyze visual attention in simulated forensic investigations. Approximately 2.6 million gaze samples were generated by 46 participants (23 from Singapore and 23 from Taiwan), which were processed and analyzed using summary statistics, cluster analysis, temporal alignment, and five machine learning models.

4.1. Statistical Result

In this study, we first applied K-means clustering to distinguish eye-tracking data from Singaporean (T01–T23) and Taiwanese (T24–T46) participants. Combining eight summary statistics of gaze events and Euclidean distances (max, mean, min, and STD for each) with eight alignment-distance features derived from EMD and N-W matrices, we obtained a 16-dimensional representation for each participant. K-means (k = 2) then achieved an overall clustering accuracy of 78.26% (14/23 Singapore correct, 22/23 Taiwan correct). These results support the hypothesis that participants from different regions exhibit distinct patterns in these feature parameters.
Next, we evaluated several regression models to predict individual proficiency scores (continuous values between 0 and 1). Random forest produced the tightest, most symmetric residuals ( R 2 = 0.7034), demonstrating the strongest generalization and capturing nonlinear interactions effectively. The decision tree performed reasonably well ( R 2 = 0.6379) but exhibited localized pockets of overfitting and underfitting. Linear regression (with progressive regression feature selection) and SVR (RBF) both showed systematic bias and wider error spread (Linear R 2 = 0.1757; SVR R 2 = 0.2909), indicating that purely linear or simple kernel approaches cannot fully capture the complex, nonlinear relationships in the data. Polynomial regression (degree 2) achieved near-zero residuals on many points but suffered from extreme outliers ( R 2 = −1.0068), clear evidence of severe overfitting. For the linear regression model (after progressive regression), EuclideanDistance_max2 and EyeMovementTypeIndex_std2 emerged as the two dominant positive predictors

4.2. Implications of the Results

The fact that EMD and N-W-based alignment distances dominate feature importance (together accounting for nearly two-thirds of total importance) indicates that temporal alignment of gaze sequences is a far richer signal of regional difference than raw spatial statistics alone. In other words, participants within the same region tend to exhibit more similar scan-path timing and order, which K-means can exploit more effectively than Euclidean displacement or fixation durations on their own. The strong performance of EMD_std and EMD_min (together explaining nearly 60% of the regional classification power) underscores that alignment-based distance measures effectively encode temporal scan-path patterns, patterns that would be difficult to detect using only spatial summaries. This also explains why K-means clustering, which can leverage these alignment distances, accurately separated Singapore and Taiwan cohorts.
Linear regression with progressive regression revealed that “maximum scan-path squared” and “eye-movement-type-variability squared” are the primary linear predictors of proficiency. At the same time, negative coefficients for path variability (EuclideanDistance_std) suggest that erratic scanning patterns mildly detract from proficiency. However, because most coefficients were near zero, the linear/quadratic model captured only a fraction of the underlying complexity—hence its residual bias and low predictive power. Random forest, on the other hand, prioritized path variability (EuclideanDistance_std and EuclideanDistance_std2) and average fixation duration. Unlike Linear Regression (which emphasizes a single extreme value), Random Forest focuses on overall variability rather than simple maxima. Additional features, like fixation-type proportions and fixation-duration variability, help the model to discriminate finer proficiency differences. Consequently, Random Forest’s ability to combine original and squared features allows it to capture linear and non-linear relationships, leading to substantially higher R 2 .

4.3. Limitations and Potential Improvements

With only 46 participants, our sample is relatively small. Small sample sizes can lead to overfitting or unstable feature importance rankings. Future studies should expand to larger cohorts to enhance generalizability. Missing or noisy fixation coordinates (NaN values) reduce the adequate sample for Euclidean calculations. While we dropped NaNs and computed statistics on remaining valid rows, any systematic errors in gaze capture (e.g., loss of tracking) could bias our distance measures.
K-means assumes spherical clusters of roughly equal size in Euclidean space, which may not perfectly fit the complex distributions of gaze and alignment features. Participants who lie near cluster boundaries or exhibit atypical scanning may be subject to misclustering. Other methods, such as Gaussian Mixture Models (GMM), might reduce misclassification rates.
Our approach reduces each participant’s scan path to summary statistics (GazeEventDuration and Euclidean distances) and alignment distances. This loses fine-grained temporal dynamics (e.g., the exact sequence of fixations over time). Some deep-learning models, such as LSTM, can more accurately obtain temporal information as a feature parameter.
Another important limitation concerns the interpretation of group-level differences. While we classified participants based on their institutional affiliation (i.e., National University of Singapore in Singapore vs. Central Police University in Taiwan), these groups differ geographically but share similar educational backgrounds. All participants were undergraduate students between the ages of 19 and 24, enrolled in their respective institutions at the time of the study. Therefore, the observed differences in eye-tracking metrics are unlikely to be caused by differences in educational background or academic standing, as both groups consisted of undergraduate students with comparable levels of education. Instead, the variations may more plausibly reflect subtle cultural or institutional influences. Future studies could further refine these comparisons by incorporating additional demographic and experiential controls to better isolate the influence of regional and institutional variables.

4.4. Implications for Robotic Perception and Design

One of the broader implications of this study lies in the development of perception modules for autonomous robotic systems. The spatiotemporal gaze patterns extracted from expert participants—such as fixation sequences, priority regions, and dwell-time heatmaps—can serve as priors for designing attention mechanisms in robots tasked with complex visual interpretation. For example, these human-derived visual strategies can be encoded into saliency-based scene exploration algorithms, enabling robots to prioritize relevant regions in a crime scene or hazardous environment more efficiently. This form of perceptual modeling bridges human cognitive behavior and robotic autonomy, providing a foundational step toward more intelligent and human-aligned robotic vision systems.
In addition to its relevance for forensic training and robot perception systems, our approach holds promise for advancing autonomous vehicle technologies. By capturing expert-like visual scanning behaviors, autonomous systems can be trained to better assess complex, high-risk environments, improving situational awareness and decision-making in real-world driving contexts.

4.5. Future Work

On the one hand, we could consider whether other clustering methods could better capture the regional separation. On the other hand, to further improve prediction accuracy, one could enhance random forest by incorporating additional high-order interaction terms or explore sequence-based deep-learning models that capture temporal eye-movement dynamics at a finer granularity.

Author Contributions

Methodology, W.-C.Y. and J.J.; data acquisition, C.-H.S. and W.-C.Y.; cses simulation experiment, C.-H.S.; writing—original draft preparation, J.J.; bibliography, J.J. and S.P.E.; writing—review and editing, J.J., S.P.E., W.-C.Y. and C.-H.C.; supervision, W.-C.Y. and C.-H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work on this paper was supported by the Ministry of the Interior (Project No. 114-0805-02-28-01) and the National Science and Technology Council (MOST 109-2410-H-015 -012 -MY2), Republic of China (Taiwan).

Institutional Review Board Statement

The study protocol was approved by the Institutional Review Board of Antai Medical Care Corporation, Antai-Tian-Sheng Memorial Hospital (IRB No. 23-095-B), on 15 December 2023.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Committee on Identifying the Needs of the Forensic Sciences Community, National Research Council. Strengthening Forensic Science in the United States: A Path Forward; National Academies Press: Washington, DC, USA, 2009. [Google Scholar]
  2. Burdea, G.C.; Coiffet, P. Virtual Reality Technology; John Wiley & Sons: Hoboken, NJ, USA, 2003. [Google Scholar]
  3. Holmqvist, K.; Nyström, M.; Andersson, R.; Dewhurst, R.; Jarodzka, H.; Van de Weijer, J. Eye Tracking: A Comprehensive Guide to Methods and Measures; Oxford University Press: Oxford, UK, 2011. [Google Scholar]
  4. Bethel, C.L.; Salomon, K.; Murphy, R.R.; Burke, J.L. Survey of psychophysiology measurements applied to human-robot interaction. In Proceedings of the RO-MAN 2007—The 16th IEEE International Symposium on Robot and Human Interactive Communication, Jeju, Republic of Korea, 26–29 August 2007; IEEE: New York, NY, USA, 2007; pp. 732–737. [Google Scholar]
  5. Yan, H.; Ang, M.H.; Poo, A.N. A survey on perception methods for human–robot interaction in social robots. Int. J. Soc. Robot. 2014, 6, 85–119. [Google Scholar] [CrossRef]
  6. Tezza, D.; Andujar, M. The state-of-the-art of human–drone interaction: A survey. IEEE Access 2019, 7, 167438–167454. [Google Scholar] [CrossRef]
  7. Mi, Z.Q.; Yang, Y. Human-robot interaction in UVs swarming: A survey. Int. J. Comput. Sci. Issues 2013, 10, 273. [Google Scholar]
  8. Suma, V. Computer vision for human-machine interaction-review. J. Trends Comput. Sci. Smart Technol. 2019, 1, 131–139. [Google Scholar] [CrossRef]
  9. Jaimes, A.; Sebe, N. Multimodal human–computer interaction: A survey. Comput. Vis. Image Underst. 2007, 108, 116–134. [Google Scholar] [CrossRef]
  10. Liu, H.; Wang, L. Gesture recognition for human-robot collaboration: A review. Int. J. Ind. Ergon. 2018, 68, 355–367. [Google Scholar] [CrossRef]
  11. Qian, K.; Hu, C. Visually gesture recognition for an interactive robot grasping application. Int. J. Multimed. Ubiquitous Eng. 2013, 8, 189–196. [Google Scholar]
  12. Xia, Z.; Lei, Q.; Yang, Y.; Zhang, H.; He, Y.; Wang, W.; Huang, M. Vision-based hand gesture recognition for human-robot collaboration: A survey. In Proceedings of the 2019 5th International Conference on Control, Automation and Robotics (ICCAR), Beijing, China, 19–22 April 2019; IEEE: New York, NY, USA, 2019; pp. 198–205. [Google Scholar]
  13. Moeslund, T.B.; Granum, E. A survey of computer vision-based human motion capture. Comput. Vis. Image Underst. 2001, 81, 231–268. [Google Scholar] [CrossRef]
  14. Chen, S.Y. Kalman filter for robot vision: A survey. IEEE Trans. Ind. Electron. 2011, 59, 4409–4420. [Google Scholar] [CrossRef]
  15. Robinson, N.; Tidd, B.; Campbell, D.; Kulić, D.; Corke, P. Robotic vision for human-robot interaction and collaboration: A survey and systematic review. ACM Trans. -Hum.-Robot. Interact. 2023, 12, 1–66. [Google Scholar] [CrossRef]
  16. Schmidt, T.; Fox, D. Self-directed lifelong learning for robot vision. In Proceedings of the Robotics Research: The 18th International Symposium ISRR, Puerto Varas, Chile, 11–14 December 2017; Springer: Berlin/Heidelberg, Germany, 2019; pp. 109–114. [Google Scholar]
  17. Liu, Y.c.; Dai, Q.h. A survey of computer vision applied in aerial robotic vehicles. In Proceedings of the 2010 International Conference on Optics, Photonics and Energy Engineering (OPEE), Wuhan, China, 10–11 May 2010; IEEE: New York, NY, USA, 2010; Volume 1, pp. 277–280. [Google Scholar]
  18. Chen, S.; Li, Y.; Kwok, N.M. Active vision in robotic systems: A survey of recent developments. Int. J. Robot. Res. 2011, 30, 1343–1377. [Google Scholar] [CrossRef]
  19. Radosavovic, I.; Shi, B.; Fu, L.; Goldberg, K.; Darrell, T.; Malik, J. Robot learning with sensorimotor pre-training. In Proceedings of the Conference on Robot Learning, Atlanta, GA, USA, 6–9 November 2023; PMLR: Cambridge, MA, USA, 2023; pp. 683–693. [Google Scholar]
  20. Wang, C.; Hasler, S.; Tanneberg, D.; Ocker, F.; Joublin, F.; Ceravola, A.; Deigmoeller, J.; Gienger, M. Lami: Large language models for multi-modal human-robot interaction. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–10. [Google Scholar]
  21. Gao, Y.; Chang, Y.; Yang, T.; Yu, Z. Consumer acceptance of social robots in domestic settings: A human-robot interaction perspective. J. Retail. Consum. Serv. 2025, 82, 104075. [Google Scholar] [CrossRef]
  22. Obrenovic, B.; Gu, X.; Wang, G.; Godinic, D.; Jakhongirov, I. Generative AI and human–robot interaction: Implications and future agenda for business, society and ethics. AI Soc. 2024, 40, 677–690. [Google Scholar] [CrossRef]
  23. Yen-Chen, L.; Zeng, A.; Song, S.; Isola, P.; Lin, T.Y. Learning to see before learning to act: Visual pre-training for manipulation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: New York, NY, USA, 2020; pp. 7286–7293. [Google Scholar]
  24. Shridhar, M.; Manuelli, L.; Fox, D. Cliport: What and where pathways for robotic manipulation. In Proceedings of the Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2022; PMLR: Cambridge, MA, USA, 2022; pp. 894–906. [Google Scholar]
  25. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: Cambridge, MA, USA, 2021; pp. 8748–8763. [Google Scholar]
  26. Nair, S.; Rajeswaran, A.; Kumar, V.; Finn, C.; Gupta, A. R3m: A universal visual representation for robot manipulation. arXiv 2022, arXiv:2203.12601. [Google Scholar] [CrossRef]
  27. Agrawal, P.; Nair, A.V.; Abbeel, P.; Malik, J.; Levine, S. Learning to poke by poking: Experiential learning of intuitive physics. Adv. Neural Inf. Process. Syst. 2016, 29, 5092–5100. [Google Scholar]
  28. Pinto, L.; Gandhi, D.; Han, Y.; Park, Y.L.; Gupta, A. The curious robot: Learning visual representations via physical interactions. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part II 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 3–18. [Google Scholar]
  29. Pinto, L.; Gupta, A. Supersizing self-supervision: Learning to grasp from 50 k tries and 700 robot hours. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; IEEE: New York, NY, USA, 2016; pp. 3406–3413. [Google Scholar]
  30. Levine, S.; Pastor, P.; Krizhevsky, A.; Ibarz, J.; Quillen, D. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robot. Res. 2018, 37, 421–436. [Google Scholar] [CrossRef]
  31. Finn, C.; Tan, X.Y.; Duan, Y.; Darrell, T.; Levine, S.; Abbeel, P. Deep spatial autoencoders for visuomotor learning. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; IEEE: New York, NY, USA, 2016; pp. 512–519. [Google Scholar]
  32. Sermanet, P.; Xu, K.; Levine, S. Unsupervised perceptual rewards for imitation learning. arXiv 2016, arXiv:1612.06699. [Google Scholar]
  33. Sermanet, P.; Lynch, C.; Chebotar, Y.; Hsu, J.; Jang, E.; Schaal, S.; Levine, S.; Brain, G. Time-contrastive networks: Self-supervised learning from video. In Proceedings of the 2018 IEEE international conference on robotics and automation (ICRA), Brisbane, Australia, 21–25 May 2018; IEEE: New York, NY, USA, 2018; pp. 1134–1141. [Google Scholar]
  34. Florence, P.; Manuelli, L.; Tedrake, R. Self-supervised correspondence in visuomotor policy learning. IEEE Robot. Autom. Lett. 2019, 5, 492–499. [Google Scholar] [CrossRef]
  35. Pari, J.; Shafiullah, N.M.; Arunachalam, S.P.; Pinto, L. The surprising effectiveness of representation learning for visual imitation. arXiv 2021, arXiv:2112.01511. [Google Scholar] [CrossRef]
  36. Zhan, A.; Zhao, R.; Pinto, L.; Abbeel, P.; Laskin, M. A framework for efficient robotic manipulation. In Proceedings of the Deep RL Workshop NeurIPS 2021, Virtual, 13 December 2021. [Google Scholar]
  37. Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 2016, 17, 1–40. [Google Scholar]
  38. James, S.; Davison, A.J.; Johns, E. Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task. In Proceedings of the Conference on Robot Learning, California, CA, USA, 13–15 November 2017; PMLR: Cambridge, MA, USA, 2017; pp. 334–343. [Google Scholar]
  39. Andrychowicz, O.M.; Baker, B.; Chociej, M.; Jozefowicz, R.; McGrew, B.; Pachocki, J.; Petron, A.; Plappert, M.; Powell, G.; Ray, A.; et al. Learning dexterous in-hand manipulation. Int. J. Robot. Res. 2020, 39, 3–20. [Google Scholar] [CrossRef]
  40. Zhan, A.; Zhao, R.; Pinto, L.; Abbeel, P.; Laskin, M. Learning Visual Robotic Control Efficiently with Contrastive Pre-training and Data Augmentation. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 4040–4047. [Google Scholar] [CrossRef]
  41. Miller, A.; Knoop, S.; Christensen, H.; Allen, P. Automatic grasp planning using shape primitives. In Proceedings of the 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422), Taipei, Taiwan, 14–19 September 2003; Volume 2, pp. 1824–1829. [Google Scholar] [CrossRef]
  42. Kuffner, J.; LaValle, S. RRT-connect: An efficient approach to single-query path planning. In Proceedings of the Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065), San Francisco, CA, USA, 24–28 April 2000; Volume 2, pp. 995–1001. [Google Scholar] [CrossRef]
  43. Zeng, A.; Song, S.; Welker, S.; Lee, J.; Rodriguez, A.; Funkhouser, T. Learning Synergies Between Pushing and Grasping with Self-Supervised Deep Reinforcement Learning. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 4238–4245. [Google Scholar] [CrossRef]
  44. Zeng, A.; Florence, P.; Tompson, J.; Welker, S.; Chien, J.; Attarian, M.; Armstrong, T.; Krasin, I.; Duong, D.; Sindhwani, V.; et al. Transporter Networks: Rearranging the Visual World for Robotic Manipulation. In Proceedings of the 2020 Conference on Robot Learning, Cambridge, CA, USA, 16–18 November 2020; Kober, J., Ramos, F., Tomlin, C., Eds.; PMLR: Cambridge, MA, USA, 2020. Proceedings of Machine Learning Research. Volume 155, pp. 726–747. [Google Scholar]
  45. James, S.; Davison, A.J. Q-Attention: Enabling Efficient Learning for Vision-Based Robotic Manipulation. IEEE Robot. Autom. Lett. 2022, 7, 1612–1619. [Google Scholar] [CrossRef]
  46. Wang, C.; Belardinelli, A.; Hasler, S.; Stouraitis, T.; Tanneberg, D.; Gienger, M. Explainable human-robot training and cooperation with augmented reality. In Proceedings of the Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, Hamburg, Germany, 23–28 April 2023; pp. 1–5. [Google Scholar]
  47. Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
  48. Gunning, D.; Stefik, M.; Choi, J.; Miller, T.; Stumpf, S.; Yang, G.Z. XAI—Explainable artificial intelligence. Sci. Robot. 2019, 4, eaay7120. [Google Scholar] [CrossRef]
  49. Wang, C.; An, P. Explainability via Interactivity? Supporting Nonexperts’ Sensemaking of pre-trained CNN by Interacting with Their Daily Surroundings. In Proceedings of the Extended Abstracts of the 2021 Annual Symposium on Computer-Human Interaction in Play, Virtual, 18–21 October 2021; CHI PLAY ’21. pp. 274–279. [Google Scholar] [CrossRef]
  50. Wang, C.; An, P. A Mobile Tool that Helps Nonexperts Make Sense of Pretrained CNN by Interacting with Their Daily Surroundings. In Proceedings of the Adjunct Publication of the 23rd International Conference on Mobile Human-Computer Interaction, Virtual, 27 September–1 October 2021. MobileHCI ’21. [Google Scholar] [CrossRef]
  51. Krupinski, E.A.; Tillack, A.A.; Richter, L.; Henderson, J.T.; Bhattacharyya, A.K.; Scott, K.M.; Graham, A.R.; Descour, M.R.; Davis, J.R.; Weinstein, R.S. Eye-movement study and human performance using telepathology virtual slides. Implications for medical education and differences with experience. Hum. Pathol. 2006, 37, 1543–1556. [Google Scholar] [CrossRef]
  52. Duchowski, A.T. A breadth-first survey of eye-tracking applications. Behav. Res. Methods Instruments Comput. 2002, 34, 455–470. [Google Scholar] [CrossRef]
  53. Liebermann, D.G.; Katz, L.; Hughes, M.D.; Bartlett, R.M.; McClements, J.; Franks, I.M. Advances in the application of information technology to sport performance. J. Sport. Sci. 2002, 20, 755–769. [Google Scholar] [CrossRef] [PubMed]
  54. Vansteenkiste, P.; Cardon, G.; Philippaerts, R.; Lenoir, M. Measuring dwell time percentage from head-mounted eye-tracking data—Comparison of a frame-by-frame and a fixation-by-fixation analysis. Ergonomics 2015, 58, 712–721. [Google Scholar] [CrossRef] [PubMed]
  55. Tien, T.; Pucher, P.H.; Sodergren, M.H.; Sriskandarajah, K.; Yang, G.Z.; Darzi, A. Differences in gaze behaviour of expert and junior surgeons performing open inguinal hernia repair. Surg. Endosc. 2015, 29, 405–413. [Google Scholar] [CrossRef]
  56. Wilson, M.; McGrath, J.; Vine, S.; Brewer, J.; Defriend, D.; Masters, R. Psychomotor control in a virtual laparoscopic surgery training environment: Gaze control parameters differentiate novices from experts. Surg. Endosc. 2010, 24, 2458–2464. [Google Scholar] [CrossRef]
  57. Chang, R.C.; Tsai, M.J. Visual behavior patterns of successful decision makers in crime scene photo investigation: An eye tracking analysis. J. Forensic Sci. 2022, 67, 1072–1083. [Google Scholar] [CrossRef]
  58. Shiomi, H.; Notsu, M.; Ota, T.; Takai, Y.; Goto, A.; Hamada, H. Influence of Proficiency on Eye Movement of the Surgeon for Laparoscopic Cholecystectomy. In Proceedings of the Digital Human Modeling. Applications in Health, Safety, Ergonomics and Risk Management: Ergonomics and Health, Los Angeles, CA, USA, 2–7 August 2015; Duffy, V.G., Ed.; Springer: Cham, Switzerland, 2015; pp. 367–373. [Google Scholar]
  59. Dyer, A.G.; Found, B.; Rogers, D. Visual Attention and Expertise for Forensic Signature Analysis. J. Forensic Sci. 2006, 51, 1397–1404. [Google Scholar] [CrossRef]
  60. Busey, T.; Yu, C.; Wyatte, D.; Vanderkolk, J.; Parada, F.; Akavipat, R. Consistency and variability among latent print examiners as revealed by eye tracking methodologies. J. Forensic Identif. 2011, 61, 60–91. [Google Scholar]
  61. Watalingam, R.D.; Richetelli, N.; Pelz, J.B.; Speir, J.A. Eye tracking to evaluate evidence recognition in crime scene investigations. Forensic Sci. Int. 2017, 280, 64–80. [Google Scholar] [CrossRef]
  62. Crosby, F.; Hermens, F. Does it look safe? An eye tracking study into the visual aspects of fear of crime. Q. J. Exp. Psychol. 2019, 72, 599–615. [Google Scholar] [CrossRef] [PubMed]
  63. Marano, D.; Cammarata, A.; Fichera, G.; Sinatra, R.; Prati, D. Modeling of a three-axes MEMS gyroscope with feedforward PI quadrature compensation. In Proceedings of the Advances on Mechanics, Design Engineering and Manufacturing: Proceedings of the International Joint Conference on Mechanics, Design Engineering & Advanced Manufacturing (JCM 2016), Catania, Italy, 14–16 September 2016; Springer: Cham, Switzerland, 2016; pp. 71–80. [Google Scholar]
  64. Ahmed, I.A.; Senan, E.M.; Rassem, T.H.; Ali, M.A.H.; Shatnawi, H.S.A.; Alwazer, S.M.; Alshahrani, M. Eye Tracking-Based Diagnosis and Early Detection of Autism Spectrum Disorder Using Machine Learning and Deep Learning Techniques. Electronics 2022, 11, 530. [Google Scholar] [CrossRef]
  65. Wang, B.; Pan, H.; Aboah, A.; Zhang, Z.; Keles, E.; Torigian, D.; Turkbey, B.; Krupinski, E.; Udupa, J.; Bagci, U. GazeGNN: A Gaze-Guided Graph Neural Network for Chest X-Ray Classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 2194–2203. [Google Scholar]
  66. Pershin, I.; Mustafaev, T.; Ibragimov, B. Contrastive Learning Approach to Predict Radiologist’s Error Based on Gaze Data. In Proceedings of the 2023 IEEE Congress on Evolutionary Computation (CEC), Chicago, IL, USA, 1–5 July 2023; pp. 1–6. [Google Scholar] [CrossRef]
  67. Luís, A.; Hsieh, C.; Nobre, I.B.; Sousa, S.C.; Maciel, A.; Jorge, J.; Moreira, C. Integrating Eye-Gaze Data into CXR DL Approaches: A Preliminary study. In Proceedings of the 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), Shanghai, China, 25–29 March 2023; pp. 196–199. [Google Scholar] [CrossRef]
  68. Pershin, I.; Mustafaev, T.; Ibragimova, D.; Ibragimov, B. Changes in radiologists’ gaze patterns against lung x-rays with different abnormalities: A randomized experiment. J. Digit. Imaging 2023, 36, 767–775. [Google Scholar] [CrossRef]
  69. Savochkina, E.; Lee, L.H.; Zhao, H.; Drukker, L.; Papageorghiou, A.T.; Alison Noble, J. First Trimester Video Saliency Prediction Using Clstmu-Net with Stochastic Augmentation. In Proceedings of the 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI), Kolkata, India, 28–31 March 2022; pp. 1–4. [Google Scholar] [CrossRef]
  70. Sharma, C.; Singh, H.; Orihuela-Espina, F.; Darzi, A.; Sodergren, M.H. Visual gaze patterns reveal surgeons’ ability to identify risk of bile duct injury during laparoscopic cholecystectomy. HPB 2021, 23, 715–722. [Google Scholar] [CrossRef]
  71. Stember, J.N.; Celik, H.; Krupinski, E.; Chang, P.D.; Mutasa, S.; Wood, B.J.; Lignelli, A.; Moonis, G.; Schwartz, L.H.; Jambawalikar, S.; et al. Eye tracking for deep learning segmentation using convolutional neural networks. J. Digit. Imaging 2019, 32, 597–604. [Google Scholar] [CrossRef] [PubMed]
  72. Mariam, K.; Afzal, O.M.; Hussain, W.; Javed, M.U.; Kiyani, A.; Rajpoot, N.; Khurram, S.A.; Khan, H.A. On Smart Gaze Based Annotation of Histopathology Images for Training of Deep Convolutional Neural Networks. IEEE J. Biomed. Health Informatics 2022, 26, 3025–3036. [Google Scholar] [CrossRef] [PubMed]
  73. Hosp, B.; Yin, M.S.; Haddawy, P.; Watcharopas, R.; Sa-ngasoongsong, P.; Kasneci, E. Differentiating Surgeons’ Expertise solely by Eye Movement Features. In Proceedings of the ICMI ’21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction, Montreal, QC, Canada, 18–22 October 2021; ICMI ’21 Companion. pp. 371–375. [Google Scholar] [CrossRef]
  74. Jiang, H.; Gao, M.; Huang, J.; Tang, C.; Zhang, X.; Liu, J. DCAMIL: Eye-tracking guided dual-cross-attention multi-instance learning for refining fundus disease detection. Expert Syst. Appl. 2024, 243, 122889. [Google Scholar] [CrossRef]
  75. Arnold, L.; Aryal, S.; Hong, B.; Nitharsan, M.; Shah, A.; Ahmed, W.; Lilani, Z.; Su, W.; Piaggio, D. A Systematic Literature Review of Eye-Tracking and Machine Learning Methods for Improving Productivity and Reading Abilities. Appl. Sci. 2025, 15, 3308. [Google Scholar] [CrossRef]
  76. Xu, Y.; Zhang, C.; Pan, B.; Yuan, Q.; Zhang, X. A portable and efficient dementia screening tool using eye tracking machine learning and virtual reality. NPJ Digit. Med. 2024, 7, 219. [Google Scholar] [CrossRef]
  77. Needleman, S.B.; Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970, 48, 443–453. [Google Scholar] [CrossRef] [PubMed]
  78. Rubner, Y.; Tomasi, C.; Guibas, L. A metric for distributions with applications to image databases. In Proceedings of the Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), Bombay, India, 7 January 1998; pp. 59–66. [Google Scholar] [CrossRef]
Figure 1. Overview of the methodology pipeline.
Figure 1. Overview of the methodology pipeline.
Electronics 14 03265 g001
Figure 2. An example of a simulated crime scene.
Figure 2. An example of a simulated crime scene.
Electronics 14 03265 g002
Figure 3. Feature importances for regional differences.
Figure 3. Feature importances for regional differences.
Electronics 14 03265 g003
Figure 4. Residual plots for five machine learning models.
Figure 4. Residual plots for five machine learning models.
Electronics 14 03265 g004
Figure 5. Linear regression feature-coefficient chart.
Figure 5. Linear regression feature-coefficient chart.
Electronics 14 03265 g005
Figure 6. Random forest feature-importance chart.
Figure 6. Random forest feature-importance chart.
Electronics 14 03265 g006
Figure 7. Feature correlation heatmap.
Figure 7. Feature correlation heatmap.
Electronics 14 03265 g007
Table 1. Confusion matrix of regional differences.
Table 1. Confusion matrix of regional differences.
RegionsPredicted 1Predicted 2
True 1 (Singapore)149
True 2 (Taiwan)122
Table 2. Evaluation results for all models.
Table 2. Evaluation results for all models.
ML ModelsMSE R 2
Linear Regression0.00660.1757
Polynomial Regression0.0161−1.0068
Random Forest0.00240.7034
Decision Tree0.00290.6379
SVR0.00570.2909
Table 3. SVM classification results of eye-tracking proficiency.
Table 3. SVM classification results of eye-tracking proficiency.
ClassesAverage AccuracyC γ Kernels2, 3, and 4 Factors
30.725010scaleRBF4
3 w/squared terms0.725010scaleRBF4
40.555010autoRBF4
4 w/squared terms0.59001autoRBF4
Table 4. Detailed 10-Fold Cross-Validation Accuracy for Best SVM Combinations.
Table 4. Detailed 10-Fold Cross-Validation Accuracy for Best SVM Combinations.
Detailed 10-Fold Scores for Best Combination (3-Class)
Fold #12345678910Average
Accuracy0.600.800.600.400.601.000.500.751.001.000.7250
Detailed 10-Fold Scores for Best Combination (4-Class)
Fold #12345678910Average
Accuracy0.400.600.600.200.601.000.750.750.750.250.5900
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, W.-C.; Shih, C.-H.; Jiang, J.; Pallas Enguita, S.; Chen, C.-H. Analyzing Visual Attention in Virtual Crime Scene Investigations Using Eye-Tracking and VR: Insights for Cognitive Modeling. Electronics 2025, 14, 3265. https://doi.org/10.3390/electronics14163265

AMA Style

Yang W-C, Shih C-H, Jiang J, Pallas Enguita S, Chen C-H. Analyzing Visual Attention in Virtual Crime Scene Investigations Using Eye-Tracking and VR: Insights for Cognitive Modeling. Electronics. 2025; 14(16):3265. https://doi.org/10.3390/electronics14163265

Chicago/Turabian Style

Yang, Wen-Chao, Chih-Hung Shih, Jiajun Jiang, Sergio Pallas Enguita, and Chung-Hao Chen. 2025. "Analyzing Visual Attention in Virtual Crime Scene Investigations Using Eye-Tracking and VR: Insights for Cognitive Modeling" Electronics 14, no. 16: 3265. https://doi.org/10.3390/electronics14163265

APA Style

Yang, W.-C., Shih, C.-H., Jiang, J., Pallas Enguita, S., & Chen, C.-H. (2025). Analyzing Visual Attention in Virtual Crime Scene Investigations Using Eye-Tracking and VR: Insights for Cognitive Modeling. Electronics, 14(16), 3265. https://doi.org/10.3390/electronics14163265

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop