1. Introduction
Eye-tracking technology captures the movement of the human eye and calculates gaze direction, facilitating contactless human–computer interaction. Compared to traditional input devices, it offers the characteristics of flexibility, speed, and accuracy, enabling direct two-way information transmission between humans and machines [
1,
2]. Emerging fields, such as telemedicine, intelligent industries, and entertainment, require timely and efficient human–computer interaction, driving the need for versatile and precise solutions with significant application potential. Eye tracking technology has become a foundational enabler for a wide spectrum of human–computer interaction applications, spanning immersive augmented reality and virtual reality headsets, portable laptop-based interaction systems, driver monitoring platforms, and assistive technologies for individuals with motor impairments. Currently, existing gaze tracking technologies are primarily categorized into two types: screen-based gaze tracking and head-mounted gaze tracking. Screen-based gaze tracking typically employs visible light cameras to capture images of the eyes and project gaze direction onto a screen [
3,
4]. This method is limited to two-dimensional tracking and is restricted to fixed-position screen interactions, resulting in low accuracy and a lack of flexibility. In contrast, head-mounted gaze tracking usually utilizes infrared imaging technology for three-dimensional tracking [
5,
6,
7]. This technology captures eye images and calculates gaze direction, offering higher accuracy and allowing for more lifelike interaction scenarios. This paper presents a self-developed, lightweight geometric framework for head-mounted binocular 3D gaze tracking, designed to deliver accurate 3D gaze vectors using only low-cost hardware without deep learning. The key contributions of this study include a robust geometric formulation that achieves competitive angular accuracy while significantly improving 3D position precision, and comprehensive evaluations comparing the proposed method with both state-of-the-art academic approaches and representative commercial systems. The remainder of this paper is structured as follows:
Section 2 reviews the related work of existing gaze tracking methods.
Section 3 details the system setup and the proposed geometric framework.
Section 4 describes the experimental design, including the participant pool, evaluation conditions, and performance metrics, and further presents the experimental results by comparing the proposed method with both academic baselines and commercial eye-tracking systems. Finally,
Section 5 concludes the paper and outlines directions for future work.
2. Related Work
Accurate estimation of gaze direction is fundamentally reliant on the precise extraction of pupil contour and the exact determination of pupil center position. Kagemoto et al. proposed high-speed pupil tracking using an event camera based on the bright and dark pupil effect. Two illumination sources generated events in the pupil area, and the pupil center was determined in real time at over 2000 Hz without requiring complete image from the events [
8]. Also, Lu et al. proposed a valuable industry-oriented ellipse detector by arc-support line segments, which simultaneously reached high detection accuracy and efficiency. Extensive experiments on three public datasets were implemented and their method achieved the best F-measure scores compared to the state-of-the-art methods [
9]. Notably, a model based on convolutional neural networks has been widely adopted in this field, enabling models to learn patterns of human eye movement from extensive datasets. Also, they proposed a novel 3D pupil localization method with a deep learning-based corneal refraction correction. Their method outperformed the state-of-the-art works by reducing the 3D pupil localization error by 47.5% and the gaze estimation error by 18.7% [
10]. In addition, Xiong et al. proposed a lightweight pupil localization algorithm, which utilized a convolutional neural network (CNN) with additional training samples. The experimental results demonstrated the algorithm’s significant effectiveness in identifying the pupil position within the training set, with the accuracy of pupil position in the test set reaching 97.78% [
11]. These approaches significantly improved the accuracy and robustness of gaze tracking. Furthermore, low-resolution facial images remain a major bottleneck for practical gaze tracking. To address this, Yan et al. proposed FSKT-GE, a lightweight knowledge transfer framework. It aligns intermediate features via cosine similarity, transferring high-resolution knowledge to low-resolution networks. Evaluated on Gaze360 and RT-Gene with 2×–8× downsampling, it achieves MAEs of 10.97–13.61° and 6.73–7.75°, outperforming mainstream methods [
12]. Additionally, lightweight deep learning models have been optimized for gaze tracking on edge devices. Habib et al. optimized MobileNetV3 via structured pruning and post-training quantization, achieving an ultra-lightweight model for multimodal eye-gaze and emotion recognition. Their method reduces inference time to 3 ms while maintaining high accuracy, offering an efficient solution for real-time gaze estimation on resource-constrained devices [
13]. However, the methods mentioned above necessitated substantial datasets for training, involved protracted development periods, and incurred high costs.
As consumer demand for seamless, intuitive, and always-on interaction grows, there is an urgent need for eye tracking solutions that deliver high 3D gaze estimation accuracy while remaining compatible with low-cost, resource-constrained, and battery-powered devices. However, existing state-of-the-art systems fail to meet this dual requirement; commercial solutions, such as Apple Vision Pro with M5 [
14], VIVE Focus Vision [
15], and Windows Studio [
16], achieve exceptional angular accuracy but rely on specialized, high-cost hardware, closed-source proprietary algorithms, and significant computational overhead, making them prohibitively expensive and power-hungry for mass-market portable electronics.
4. Experimental Results and Analysis
A total of 30 healthy adult Asian subjects, consisting of 15 male and 15 female participants, were recruited and involved in the experimental tests. They participated in laboratory-based wear trials to systematically evaluate the robustness of the proposed algorithm. Participants met the inclusion criteria of having normal or corrected-to-normal vision, no ocular diseases, no severe head tremor, and no history of eye surgery, while those with strabismus, neurological disorders, or inability to maintain stable head posture were excluded. Among them, 12 wore glasses, 5 wore contact lenses, and 13 had uncorrected vision; eye dominance was balanced (16 right-eye dominant, 14 left-eye dominant), with an average interpupillary distance of 63.5 ± 2.1 mm.
Experiments were conducted at seven fixed depth planes which were evenly spaced within the range of 0.3 m to 1 m. To ensure a rigorous and unbiased evaluation, calibration and test points were strictly separated and never reused across phases. As detailed in
Section 3.6, a fixed set of 19 calibration points was used to estimate the model parameters, while a disjoint set of 15 test points was used for accuracy evaluation at each depth plane. No test point was included in the calibration set, eliminating any risk of inflated accuracy due to data leakage. At each plane, participants completed 15 test targets, with five repetitions per test target. All tests were performed in a controlled indoor environment with uniform soft lighting, and participants rested their chin on a soft support to maintain natural head posture and minimize movement. The head-mounted device integrated an industrial-grade 950 nm infrared camera as the input sensor, paired with an IR LED fill light of identical central wavelength, capturing video at 1920 × 1080 resolution and 60 FPS. The software environment included Python 3.12, and OpenCV 4.8. During each test, they were instructed to fixate on a target point displayed on the screen. We conducted a comparative experiment on the mainstream laptop hardware platform, which was equipped with an Intel Core Ultra 9 288V processor. The base frequency was 2.20 GHz, it had eight cores and eight threads, and the single-core turbo frequency could reach 5.10 GHz. It adopted the Lunar Lake architecture, with a 3 nm manufacturing process and a TDP power consumption of 30 W, and the memory used 32 GB LPDDR5X RAM. A commercial high-precision 3D eye tracker (RED 500, SensoMotoric Instruments GmbH, Teltow, Germany) was adopted as the ground truth reference, which can provide accurate 3D point of regard coordinates with a mean error of less than 0.5 mm [
22]. Regarding recalibration frequency, a one-time calibration procedure was performed at the beginning of each participant’s session. Considering that each participant spent only a short time on the experiment, the calibrated parameters were used consistently for all subsequent depth planes and test trials. The calibrated parameters were used consistently for all subsequent depth planes and test trials. This design choice was validated in preliminary experiments, which confirmed stable performance across the 0.3–1.0 m depth range without recalibration. We evaluated the 2D angular error and the absolute 3D Euclidean position error across the seven depth planes, to fully characterize the 3D behavior of the proposed method; the required spatial coordinate points for observation are shown in
Figure 9.
During human eye observation, several highly disruptive conditions are unavoidable. Specifically, optical imaging effects can induce non-linear pupil deformation, such as stretching and squeezing, when the eye views objects from extreme angles. Additionally, partial occlusion of the pupil may occur, which compromises pupil integrity and presents a substantial challenge to accurate feature extraction and localization.
Figure 10 illustrates representative pupil center detection results for these scenarios, where the red dot marks the detected pupil center, and the three yellow dots and the dotted line represent the medial and lateral canthi and the midpoint of the line connecting them. Even at extreme viewing angles, where the pupil deviates from its ideal circular shape to an irregular elliptical form, the proposed algorithm reliably and accurately segments and identifies the elliptical pupil region. In addition, for test cases where pupil occlusion does not exceed 40%, the algorithm effectively mitigates interference through contextual feature completion and redundant feature verification, consistently producing stable localization results that match the true pupil position without failure due to partial information loss.
In a spherical coordinate system with the eyeball as the origin, gaze movement within a range of ±12° vertically and ±10° horizontally leads to the experimental results presented in
Figure 11.
As shown in
Figure 11, we first evaluated the angular estimation performance across all seven depth planes. In
Figure 11a–g sub-figures, the black plus signs represent the ground truth gaze angles, the red dots represent the estimated gaze angles from the proposed method, and the black dashed lines connect the corresponding points to visualize the estimation error. The results clearly show that the estimated points are highly consistent with the ground truth points across all depth planes. For the 3D positioning performance, we further evaluated the estimation results under three different pitch conditions: upward gaze (θ = 10°), neutral gaze (θ = 0°), and downward gaze (θ = −10°), as shown in
Figure 11h,
Figure 11i and
Figure 11j, respectively. In these 3D visualization plots, the horizontal axis denotes the horizontal gaze angle φ, and the oblique axis represents the target depth. The red plus signs indicate the ground truth 3D points, the green dots represent the estimated 3D points, and the error bars overlaid on the green dots denote the 95% confidence intervals of the depth estimation, derived from multiple fixations on the same target point. The confidence interval of the depth estimation remains small across all depth ranges, which means the proposed method can provide reliable depth estimation with stable uncertainty, even for far targets. We adopted the RMS error of three-dimensional coordinates to quantitatively assess the 3D positioning accuracy for each target point. The RMS error is calculated as
where (
xi,
yi,
zi) and (
) denote the ground truth and estimated 3D coordinates of each individual target point, respectively, and
N is the total number of test samples.
Figure 12 shows the distribution of the RMS of the gaze point estimation error varying with depth under different perspectives.
As illustrated in
Figure 12, the RMS error of depth estimation increases monotonically with target distance for all gaze directions, following a sub-quadratic growth trend that reflects the inherent uncertainty of stereo triangulation. At 0.3 m, the error remains below 3 mm for all configurations; at 1.0 m, it rises to approximately 18 mm under the most challenging ±12° horizontal condition. The most significant performance degradation occurs with increasing horizontal gaze angles: errors at ±6° and ±12° are substantially higher than at 0°, resulting in an overall RMS error of 2.82° for the proposed method. In contrast, vertical gaze angles (±10°) cause only a modest increase in error, with the upper and lower curves nearly indistinguishable. Overall, the proposed system maintains RMS depth accuracy within 20 mm across the intended 0.3–1.0 m working range, even for large off-axis angles, ensuring reliable performance for typical laptop and near-eye interaction scenarios. As the target distance increases from 30 cm to 90 cm, the 3D position error increases from 8.23 mm to 18.92 mm, with a slight increase in angular error. This trend follows the inherent principle of binocular stereo vision. We analyzed the correlation between the estimated vergence angle, which is the angle between the left and right eye gaze vectors, and the true target distance, which is the core physiological cue for human depth perception. The results show that the correlation coefficient between the estimated vergence angle and the true target distance reaches 0.97 overall, and remains above 0.95 across all seven depth planes. This indicates that as the target distance changes, the estimated vergence angle changes synchronously and stably, which proves that the proposed method can truly capture the 3D characteristic of the gaze.
We further compared the proposed method with state-of-the-art existing methods, including academic solutions based on pupil core eye-tracking goggles [
23], and the academic method for generating accurate 3D gaze vectors using synchronized eye tracking and motion capture [
24], to clarify the contribution of the proposed method. The results are summarized in
Table 2.
Table 2 compares the performance of our proposed method with two state-of-the-art gaze tracking approaches. The proposed lightweight geometric method achieves comparable angular error (1.1–2.82°) to the other two methods, while it significantly outperforms them in terms of 3D position error. More importantly, our approach provides a competitive 3D position RMS error of less than 13.24 mm, which is 34% lower than the eye-motion capture-based method (<20 mm) and 56% lower than the noise estimation-based method (<30 mm). This balance of high angular precision and low 3D positioning error validates the effectiveness of our self-developed geometric framework for practical eye-tracking applications. We further include three representative consumer-grade eye-tracking systems, namely Apple Vision Pro with M5, VIVE Focus Vision, and Windows Studio, as detailed in
Table 3.
It should be noted that these performance values are taken from official specifications and published materials, and were not evaluated under the same controlled experimental conditions as our method. While these commercial platforms may achieve comparable angular accuracy in their intended use cases, they typically do not report detailed 3D position error metrics. In contrast, our approach delivers both competitive angular error and explicitly validated superior 3D positioning precision, demonstrating its suitability for high-precision 3D gaze tracking applications. Compared with high-end commercial systems, the proposed method achieves comparable 3D positioning accuracy with much lower hardware cost, without requiring dedicated and costly high-precision sensors, making it suitable for entry-level devices.
5. Conclusions
This study presents a lightweight, purely geometric framework for high-precision 3D gaze tracking based on infrared image processing, which achieves robust and real-time performance on low-cost, resource-constrained platforms. The proposed method abandons heavy deep learning networks and relies on geometric modeling, binocular stereo constraints, and infrared optical design to accurately estimate 3D gaze under variable viewing angles and depth conditions, effectively suppressing performance degradation caused by large viewing angles, pupil occlusion, and unreliable depth extraction in low-resolution infrared images. Experimental results show that the framework achieves favorable precision, strong robustness against partial occlusion, as well as extremely low computational overhead and high real-time performance. Compared with appearance-based deep learning methods, conventional model-based schemes and commercial eye-tracking devices, the proposed framework exhibits clear advantages in terms of computational expense, ease of deployment, 3D estimation accuracy and costs, making it suitable for head-mounted devices, wearable systems, automotive driver monitoring, and portable human–computer interaction applications. Future work will focus on further improving robustness under extreme head rotations, long-distance measurement, and eyeglass interference, as well as optimizing the geometric model to further improve the accuracy of gaze estimation and extend the method to more compact binocular infrared sensing architectures.