1. Introduction
Amblyopia is a common developmental visual disorder characterized by reduced visual function that cannot be fully explained by structural abnormalities of the eye and is typically associated with abnormal visual experience during early childhood (e.g., anisometropia, strabismus, or both) [
1,
2,
3]. Because neural plasticity is highest in early life, timely detection and intervention are critical: earlier treatment is associated with better outcomes, whereas delayed diagnosis increases the risk of persistent visual impairment and reduced binocular function into adulthood [
4,
5]. Despite this, population-level screening remains challenging. Conventional screening often relies on monocular visual acuity testing, stereoacuity, and/or instrument-based photo screening, each with practical limitations in young children, variable sensitivity across amblyopia etiologies, and dependence on attention, cooperation, and trained personnel [
6,
7]. In busy clinics or community settings, there is a pressing need for child-friendly, time-efficient, and objective screening tools that can be deployed with minimal burden while still capturing functional markers relevant to amblyopia [
8].
Beyond reduced acuity, amblyopia is associated with characteristic alterations in oculomotor behavior [
9]. A growing body of work suggests that fixation stability, microscale drift, saccadic control, and smooth-pursuit fidelity can be compromised, reflecting altered sensory encoding and visuomotor integration [
10,
11]. Importantly, these oculomotor signatures can persist even under binocular viewing because binocular fusion and the dominant (fellow) eye do not necessarily normalize the amblyopic eye’s sensory noise, interocular imbalance, or motor control adaptations acquired during development [
12,
13,
14,
15]. Such findings motivate oculomotor assessment as an attractive complement to acuity-based screening: eye movements provide a continuous, quantitative readout of the visual system during naturalistic viewing and can be elicited by short tasks that are engaging for children [
16,
17,
18,
19,
20,
21].
Eye tracking provides a practical measurement modality for such oculomotor biomarkers [
22,
23,
24]. Early studies using laboratory-grade trackers under head restraint demonstrated that amblyopia can be associated with increased fixation instability, abnormal saccadic latencies, altered pursuit gain, and more frequent catch-up saccades [
25,
26,
27,
28,
29]. However, translating these insights to screening faces several barriers. First, many laboratory protocols use specialized equipment, high sampling rates, and rigid head stabilization, which are not easily compatible with child screening in natural settings. Second, tasks that isolate a single oculomotor component may not be sufficiently robust in real-world deployment because children differ in compliance, fatigue, and attentional state [
30,
31,
32,
33]. Third, algorithmic pipelines can be sensitive to sampling rate and preprocessing choices; metrics such as microsaccade rate and fine-grained velocity profiles are difficult to estimate reliably at low sampling rates typical of lightweight wearable devices [
34,
35,
36]. Finally, reported model performance in small cohorts can be inflated by methodological pitfalls, particularly information leakage when trial-level samples from the same subject are split across training and testing, or when preprocessing and operating-point selection implicitly use held-out subjects. For screening applications, rigorous subject-wise evaluation and transparent reporting are therefore essential [
37].
An additional practical question concerns how to leverage binocular eye tracking for unilateral amblyopia. In screening contexts, children typically view stimuli binocularly (both eyes open) under habitual correction. This condition is ecologically valid, but it raises a methodological choice: whether to model the amblyopic-eye stream, the fellow-eye stream, or a binocularly combined (“cyclopean”) gaze estimate. Cyclopean representations can improve robustness to transient tracking loss and reflect integrated visuomotor output during natural binocular viewing; however, they may also attenuate amblyopia signatures if the fellow eye dominates control [
24]. Addressing this trade-off requires clear signal definition, principled missing-data handling, and sensitivity analyses that compare performance across gaze-stream choices.
Motivated by these considerations, we propose a child-oriented screening framework that combines a lightweight wearable binocular eye tracker, a brief task-evoked protocol, and an interpretable machine-learning model evaluated under a strict subject-wise design. The protocol comprises three short tasks that probe complementary oculomotor behaviors relevant to amblyopia: (1) visual enumeration/search to capture scanning and stability, (2) orientation-dependent grating viewing to probe gaze stabilization, and (3) two-dimensional smooth pursuit to quantify eye–target coordination and corrective saccades. Together, the tasks yield clinically interpretable features that summarize gaze stability, saccadic organization, and pursuit fidelity, improving robustness beyond any single task.
To obtain reproducible estimates, we use a nested, strictly subject-wise evaluation pipeline (see
Figure 1). Standardization, imputation, model fitting, probability calibration, and operating-point selection are learned from training subjects only within each split. We use a Random Forest as the primary classifier and calibrate probabilities with isotonic regression. We report discrimination (AUC, average precision) and operating-point metrics (sensitivity, specificity, and accuracy) with uncertainty estimates, and translate sensitivity/specificity into expected PPV/NPV across plausible prevalence values.
We evaluate this framework in school-aged children with unilateral amblyopia and age-matched controls under best-corrected binocular viewing. Our contributions are threefold: (1) a practical wearable binocular eye-tracking protocol suitable for children; (2) an interpretable multi-task feature representation evaluated with leakage-resistant nested subject-wise validation and calibrated thresholding; and (3) screening-oriented analyses including baseline comparisons, task ablation/feature importance, robustness to preprocessing choices, and sensitivity to gaze-stream definition (cyclopean vs. eye-specific). These results support task-evoked wearable eye tracking as a quantitative screening approach for unilateral amblyopia.
1.1. Hardware Description
We employed a custom wearable binocular eye-tracking prototype optimized for school-aged children and natural head posture during short screening-style tasks. The device integrates three identical off-the-shelf USB camera modules in a lightweight head-mounted frame: two inward-facing cameras (one per eye) for infrared (IR) eye imaging and pupil localization, and one outward-facing world camera to capture the stimulus display for screen localization.
All three cameras used the same module and optics (12-megapixel sensor; maximum resolution 3840 × 3104 pixels; 3.95 mm lens; approximately 80° field of view). For eye imaging, the inward-facing cameras operated with integrated 940 nm IR illumination to provide stable pupil observation while reducing sensitivity to ambient visible-light variation.
The gaze signal for subsequent analyses was produced at 30 Hz. This sampling configuration preserves high eye-image resolution for reliable pupil-center localization while maintaining a compact and lightweight wearable design. All downstream analyses were designed to rely on oculomotor descriptors that are robust at 30 Hz (e.g., gross fixation dispersion, scan-path statistics, and pursuit fidelity), rather than fine-scale microsaccadic dynamics.
A Python 3.9-based host application performed synchronized acquisition and online processing. At each time step, pupil centers were detected independently in the left- and right-eye IR image streams and mapped to gaze locations in display coordinates using per-eye calibration functions. To maintain a stable definition of screen coordinates under natural head motion, ArUco fiducial markers were placed on and around the display and detected in the world-camera stream. Marker corners were used to estimate the screen plane and compute a planar transformation between world-camera pixels and the known screen coordinate system. Because the world camera was rigidly mounted to the eye cameras, updating the screen plane in the world-camera frame provides a consistent gaze-to-screen mapping under minor pose variations.
Visual stimuli were presented on a 24-inch liquid-crystal display (1920 × 1080 pixels; 60 Hz) at a viewing distance of 60 cm. The monitor was luminance-calibrated before testing (white field 120 cd/m2; gamma 2.2) to ensure repeatable photometric conditions across participants and sessions.
Calibration used a child-friendly five-point dynamic procedure. Five screen locations were displayed sequentially (800 ms per location). For each eye, we fitted a second-order two-dimensional polynomial mapping from pupil-center coordinates in eye-image space to gaze locations in screen coordinates using least-squares regression. The calibration sequence could be repeated once when necessary. Calibration was accepted when the mean screen-plane error did not exceed 0.8° of visual angle.
Rigid head restraint (e.g., chin rest) was not imposed. Participants were instructed to maintain an upright posture and to keep their head as still as possible. Calibration was repeated if noticeable drift occurred. Analyses were restricted to trials in which gaze remained within the calibrated region.
1.2. Participants and Clinical Assessment
Seventy school-aged participants were recruited from the ophthalmology outpatient clinic and affiliated community health centers of Beijing Tongren Hospital, including 35 children with clinically diagnosed unilateral amblyopia and 35 age-matched visually normal controls. Demographic and clinical characteristics are summarized in
Table 1.
All participants underwent a comprehensive clinical assessment performed by a pediatric ophthalmologist, including best-corrected visual acuity (BCVA), refraction, ocular alignment evaluation, and stereopsis testing, following routine clinical protocols at our center. BCVA was recorded for each eye under habitual optical correction and reported in logMAR units. Amblyopia etiology was classified as anisometropic, strabismic, or mixed based on clinical findings (
Table 1).
Amblyopia-group inclusion criteria were: (i) unilateral amblyopia diagnosed clinically; (ii) a BCVA in the amblyopic eye between 0.2 and 0.5 logMAR (approximately 0.3–0.6 decimal) with an interocular acuity difference of at least 0.2 logMAR; and (iii) anisometropia, strabismus, or a mixed mechanism identified as the underlying cause. Exclusion criteria included a history of intraocular surgery, ocular pathology other than amblyopia, or neurological disease.
Control-group inclusion criteria were a BCVA of at least 0.8 (decimal; ≤0.1 logMAR) in each eye with an interocular difference of at most 0.1 logMAR, normal binocular alignment and stereopsis, and emmetropia or minimal refractive error.
All eye-tracking tasks were performed binocularly with both eyes open while children wore their habitual optical correction (spectacles or contact lenses, when prescribed) that achieved their enrollment BCVA, so that the recordings reflected best-corrected function rather than uncorrected blur. Children with known cognitive impairment or a history of neurological or psychiatric disorders were excluded. The study protocol was approved by the Institutional Ethics Committee of Beijing Tongren Hospital, and written informed consent was obtained from parents/guardians in accordance with the Declaration of Helsinki.
1.3. Experimental Paradigm
To probe oculomotor impairments associated with unilateral amblyopia under binocular viewing conditions, we designed a three-stage, task-driven protocol comprising (1) an animal icon enumeration task, (2) an orientation-dependent grating viewing task, and (3) a dynamic sinusoidal pursuit task. All tasks were presented on the calibrated LCD display in a quiet, dimly lit room at a viewing distance of 60 cm. Eye movements were recorded binocularly and continuously throughout the experiment.
Each task was organized into discrete trials defined by task-specific onset and offset events. Practice trials were provided before each task to ensure that children understood the instructions and response requirements. Short breaks (~30 s) were provided between tasks, and additional brief pauses were allowed as needed to maintain attention and comfort. The task structure and timing are shown in
Table 2.
1.3.1. Task 1: Animal Icon Enumeration
In Task 1 (see
Figure 2a), participants viewed a child-friendly array containing multiple cartoon animal categories arranged with uniform spacing on a neutral background. At trial onset, an on-screen prompt specified the target category (e.g., “panda”). Participants freely explored the array to count the number of target animals and reported the total via keypress/button response. The array remained on screen until the response or a maximum of 60 s.
This task was designed to quantify fixation stability, search efficiency, and saccadic strategy (including regressions and refixations) under naturalistic exploration. Because the array contains fine spatial features (edges and textures), the task imposes demands on spatial vision and attention that may elicit compensatory search patterns in children with amblyopia even during best-corrected binocular viewing [
38,
39].
1.3.2. Task 2: Orientation-Dependent Grating Viewing
Task 2 (see
Figure 2b) assessed orientation-dependent pattern processing using centrally presented, static Gabor patches (Gaussian-windowed sinusoidal gratings). Each patch subtended 3° of visual angle and had a spatial frequency of 6 cycles/degree. Stimuli were presented at 12 orientations (0°, 15°, 30°, …, 165°) and at three suprathreshold Michelson contrast levels (0.2, 0.4, and 0.8), in pseudorandom order.
In each trial, the Gabor stimulus was displayed for 2 s against a mean-luminance background, followed by a 1 s blank interval. The carrier phase was randomized at each presentation to discourage reliance on local luminance cues. To ensure engagement, participants provided a coarse orientation report using four arrow keys mapped to four orientation bins (horizontal, vertical, left-oblique, and right-oblique). Behavioral responses were recorded to confirm task compliance but were not used as predictors in the present analysis.
The formal Task 2 block comprised 72 trials (12 orientations × 3 contrasts × 2 repetitions), presented in pseudorandom order. The primary analyses focus on orientation-dependent oculomotor behavior (e.g., fixation stability and drift) during stimulus viewing rather than psychophysical threshold estimation [
40].
1.3.3. Task 3: Smooth Pursuit of a Sinusoidal Trajectory
In Task 3 (see
Figure 2c), participants tracked a circular target (diameter 0.5°) moving along a two-dimensional sinusoidal trajectory for 20 s. The horizontal component had a peak-to-peak amplitude of approximately 30° (±15° from screen center), and the vertical component had a peak-to-peak amplitude of approximately 15° (±7.5° from center), yielding a maximum tangential velocity of approximately 10°/s within the central visual field. Random phase offsets were used across trials to reduce purely predictive tracking. A central fixation cross was presented for 3 s before each trial.
This task evaluated pursuit fidelity and eye–target coordination under binocular viewing. We expected larger tracking error, reduced pursuit gain, and more frequent corrective (catch-up) saccades in children with amblyopia due to reduced fidelity of motion processing and oculomotor control when one eye provides degraded input.
1.4. Task Sequence and Balance
The three tasks were administered in a fixed order (Task 1 → Task 2 → Task 3) to standardize the screening procedure and reduce instruction switching for children. The sequence progressed from a self-paced visual search task to brief discrete fixation trials and then to continuous smooth pursuit, allowing participants to become familiar with the display, response mapping, and head-mounted tracker before the continuous tracking block. Short inter-task breaks (~30 s) were provided, and additional breaks were allowed as needed.
To address potential fatigue or learning effects associated with the fixed order, trial order and timing were logged. Data-quality indicators were monitored throughout the session, including the proportion of invalid samples per trial (off-screen samples and samples removed due to blinks or tracking loss) and trial exclusion rates under predefined quality criteria. Order effects were evaluated by comparing early versus late trials within each task. These analyses are described in
Section 1.6 and reported in the results.
All tasks were implemented in PsychoPy (v2023.2) and presented at a resolution of 1920 × 1080 pixels with a 60 Hz refresh rate. Stimulus timing was validated prior to data collection using a photodiode to confirm frame-locked onsets and accurate stimulus durations.
1.5. Signal Processing and Algorithm Design
The eye-tracking dataset provides time-resolved oculomotor information that can be summarized into clinically interpretable descriptors of gaze stability, saccadic strategy, and pursuit fidelity. We designed an analysis pipeline that transforms raw binocular gaze streams into fixed-length trial-level feature vectors and integrates these features into supervised classifiers to distinguish children with unilateral amblyopia from visually normal controls.
To ensure that reported performance reflects generalization to unseen subjects, all preprocessing transformations, model fitting, probability calibration, and operating-point selection were performed under a strictly subject-wise evaluation scheme. The following subsections describe gaze-signal definition and preprocessing, task-specific feature extraction, and the model development and evaluation protocol.
1.5.1. Gaze Signal Definition and Preprocessing
The wearable tracker outputs two-dimensional gaze positions for the left and right eyes in screen coordinates at 30 Hz. Gaze coordinates were converted from pixels to degrees of visual angle using the known screen geometry and viewing distance so that downstream features were computed in angular units and were comparable across participants.
Primary analyses used a cyclopean gaze signal to reflect binocular viewing during screening. When both eyes provided valid samples at a given timestamp, cyclopean gaze was computed as the sample-wise mean of the two eye positions. When only one eye sample was valid, that eye’s gaze sample was used. When neither eye was valid, the sample was treated as missing. In secondary analyses, features were recomputed using the amblyopic-eye stream and fellow-eye stream separately (based on clinical labeling) to assess dependence on the gaze definition.
Invalid samples were identified using device validity indicators (e.g., pupil-loss flags) and a kinematic plausibility rule. Samples were marked invalid if associated with tracking loss, if they fell outside the display region, or if the instantaneous angular velocity exceeded 1000°/s (indicative of transient tracking glitches). To reduce blink-edge artifacts, samples within ±1 frame (±33 ms) of invalid segments were also excluded. The cleaned gaze stream was segmented into trials according to task-defined boundaries.
Missing-data handling was designed to preserve robustness at 30 Hz without introducing artificial smoothness. Gaps of up to 200 ms (≤6 consecutive missing samples) were bridged by linear interpolation; longer gaps were retained as missing. Trials were excluded if more than 30% of samples were invalid or missing after cleaning. These criteria ensure that dispersion- and event-related measures are computed from sufficiently continuous data.
To suppress high-frequency jitter while retaining task-relevant trends, gaze position was smoothed using a Savitzky–Golay filter (window length: five samples; polynomial order 2). Velocity was then estimated by a central finite difference on the smoothed position signals. For the pursuit task, target velocity was computed analytically from the known trajectory and sampling times. All event-related metrics were interpreted at a coarse temporal resolution consistent with 30 Hz sampling.
More detailed parameters are provided in
Table 3.
1.5.2. Feature Extraction
Raw gaze samples were recorded as two-dimensional coordinate sequences (x, y) and segmented into trials. For each trial, we computed a fixed-length feature vector summarizing gaze stability, saccadic strategy, and pursuit fidelity. Unless otherwise specified, angular quantities are expressed in degrees of visual angle.
Task 1 (visual enumeration) features characterized free-viewing behavior. Fixations were segmented using an I-DT algorithm (dispersion threshold 1.0°; minimum duration 100 ms). Fixation-duration statistics (mean and standard deviation), fixation dispersion (within-fixation RMS), and the number of refixations were computed per trial. Saccade amplitude was defined as the Euclidean distance between successive fixation centroids. Regression counts were computed as the number of saccades directed opposite to the subject’s predominant horizontal scanning direction within the array. Scan-path entropy quantified the unpredictability of the fixation-to-fixation transition sequence: fixation centroids were mapped to a uniform 8 × 6 grid over the display, and Shannon entropy was computed over the resulting state-transition sequence.
Task 2 (grating viewing) features characterized fixation stability and orientation-dependent drift during stimulus viewing. Trials were analyzed within a circular region of interest (ROI) centered on the stimulus (radius 2.0°). Fixation latency was defined as the time from stimulus onset to the first fixation whose centroid fell inside the ROI. Directional drift was computed as the net gaze displacement projected onto the stimulus orientation axis within the trial. Fixation dispersion was quantified as the area of the convex hull of gaze samples within the ROI, providing a robust measure of spatial spread over the 2 s viewing period.
To quantify small corrective behavior at a resolution compatible with 30 Hz, we defined small corrective saccade-like events using velocity thresholding on the smoothed gaze signal. A saccade-like event was detected when instantaneous speed exceeded 60°/s for at least one sample; event amplitude was defined as the position change between the samples immediately before and after the threshold crossing. Events with an amplitude between 0.3° and 1.0° were counted as small corrective saccade-like events. The resulting count was normalized by trial duration and reported as a rate (min−1). This metric is interpreted as a coarse index of small corrective behavior rather than a precise estimate of canonical microsaccades.
Task 3 (sinusoidal pursuit) features were computed by aligning gaze samples to the known target trajectory frame by frame. Tracking error was computed as the mean Euclidean distance between gaze position and target center over the trial. Pursuit gain was computed as the median ratio of gaze speed to target speed for samples where the target speed exceeded 2°/s, reducing instability near velocity reversals. Catch-up saccades were defined as saccade-like events with an amplitude greater than 1.5° that reduced instantaneous position error. Pursuit onset latency was defined as the first time after motion onset at which the gaze velocity exceeded 5°/s and remained directionally consistent with the target for at least three consecutive samples.
Trial-level feature vectors extracted from the three tasks were concatenated into a single representation for classification. Feature dimensionality was kept modest (dozens of features) to match the cohort size and to facilitate interpretability.
1.5.3. Model Development and Evaluation
Trial-level feature vectors extracted from the three tasks were concatenated into a single representation, , where indexes subjects and indexes trials. Within each cross-validation split, features were standardized by z-scoring using normalization parameters estimated only from training subjects and then applied unchanged to validation/test subjects. Missing feature values (e.g., due to trial exclusion after quality control) were imputed using feature-wise medians learned only from the training data, with the learned mapping applied consistently to held-out subjects to avoid information leakage.
The primary classifier was a Random Forest composed of
decision trees trained on bootstrap-resampled trials from the training subjects. Splits were selected by minimizing Gini impurity, and candidate features were randomly subsampled at each node (
) to encourage diversity among trees. Class weights were set inversely proportional to class frequencies within each training set. For subject
and trial
, the forest produced a trial-level posterior probability by averaging the tree-level posteriors:
where
denotes the probabilistic output of the
m-th tree.
Because Random Forest probabilities may be miscalibrated, trial-level probabilities were calibrated using isotonic regression learned within the training data. Denoting the calibration map by
, the calibrated trial probability is
Importantly, was fitted exclusively in the inner cross-validation loop (described below) using only training subjects and then applied unchanged to the held-out subject in the outer test fold.
Subject-level evidence was obtained by robustly aggregating calibrated trial probabilities for each subject. The primary analysis used the median across all valid trials available for that subject:
A binary decision for subject
was then produced by thresholding the aggregated probability:
where
is the indicator function and
is the operating threshold. The threshold
was selected exclusively within the inner training data by maximizing Youden’s
statistic (
and was then applied unchanged to the held-out subject in the outer test fold.
Model selection and performance evaluation followed a nested, subject-wise protocol to ensure that all learning steps were confined to training subjects. Specifically, an outer leave-one-subject-out loop was used for testing, and within each outer training set, inner group-wise cross-validation over the remaining subjects was used to tune hyperparameters, fit the isotonic calibration map, and select the operating threshold. All preprocessing steps (standardization and imputation), calibration, and threshold selection were performed strictly within the training folds and then transferred to the corresponding held-out data. Performance was reported at the subject level using discrimination metrics (area under the ROC curve, AUC; and average precision, AP) and operating-point metrics (sensitivity, specificity, and accuracy). Confidence intervals for sensitivity and specificity were computed using binomial intervals (Wilson method), and confidence intervals for AUC/AP were computed by subject-level bootstrap resampling to respect the grouped structure of the data.
To contextualize the proposed model, baseline classifiers were trained on the same gaze feature set under the same nested, subject-wise evaluation protocol, including regularized logistic regression and a linear support vector machine; additional non-linear baselines (e.g., gradient-boosted decision trees) were evaluated using the same preprocessing, calibration, and threshold-selection procedures to enable fair comparison. In addition, to quantify the incremental value of gaze dynamics beyond routine clinical measures, we constructed (i) a clinical-only baseline and (ii) a combined clinical + gaze model using the same subject-wise evaluation protocol. The clinical feature set included best-corrected visual acuity (logMAR) in each eye, interocular acuity difference (IOD, logMAR), stereoacuity (arcsec; log-transformed), and refractive error summarized as spherical equivalent (SE, diopters) for each eye and anisometropia (). For the clinical-only baseline, a regularized logistic regression classifier was trained on subject-level clinical vectors. For the combined model, clinical features were concatenated with subject-level aggregated gaze features (median across retained trials per task, then concatenated across tasks), with all preprocessing and model selection performed within the training folds only.
1.6. Statistical Analysis
Descriptive group comparisons of demographic and clinical variables followed
Table 1: independent-samples
t-tests were used for approximately normally distributed continuous variables, Mann–Whitney U tests for non-normal variables (stereoacuity), and chi-square tests for categorical variables (sex).
For feature-level group comparisons (when reported), trial-level features were first aggregated to the subject level (median across valid trials within each task; Task 2 features were additionally summarized across orientations). Between-group differences were assessed using independent-samples t-tests or Mann–Whitney U tests depending on normality (Shapiro–Wilk test). Effect sizes were summarized using Cohen’s d (t-tests) or rank-biserial correlation (Mann–Whitney U). When multiple features were tested, the false-discovery rate (FDR) was controlled using the Benjamini–Hochberg procedure (q = 0.05).
Order effects associated with the fixed task sequence were assessed using data quality and oculomotor summaries. Within each task, we compared early versus late trials (first half vs. second half) for the fraction of invalid samples and trial exclusion rates.
1.7. Participant Characteristics and Data Quality
All 70 enrolled participants completed the three-task protocol. Demographic and clinical characteristics are reported in
Table 1. Groups were age-matched (control: 8.7 ± 1.6 years; amblyopia: 9.1 ± 1.7 years) with a similar sex distribution. As expected, the amblyopia group exhibited reduced best-corrected visual acuity in the amblyopic eye and markedly impaired stereoacuity.
Eye-tracking data quality was high across tasks under natural head posture. After blink/off-screen removal and validity screening, the median retained trials per participant were 7/8 for Task 1, 69/72 for Task 2, and 6/6 for Task 3, with no meaningful between-group difference (
Table 4). Across retained trials, the mean fraction of invalid samples remained below 8% for all tasks in both groups (
Table 4). Based on the retained-trial counts, the corresponding mean trial rejection rates were low overall (approximately 0–11% across tasks) and were similar between groups.
To evaluate potential order-related confounds (Task 1 → Task 2 → Task 3), we compared early vs. late trials within each task (see
Table 5). Invalid-sample fractions increased slightly over time in both groups (typically <2 percentage points), and the magnitude of change did not differ meaningfully between groups, suggesting minimal fatigue- or learning-related bias in the data. Together, these results suggest that the fixed task order is unlikely to materially influence the quality-controlled data used for downstream analyses.
1.8. Task-Evoked Oculomotor Differences Between Groups
Across all three tasks, children with unilateral amblyopia exhibited systematic differences in gaze stability and oculomotor control under best-corrected binocular viewing.
Figure 3 summarizes representative feature distributions, and
Table 6,
Table 7 and
Table 8 report subject-level summary statistics and standardized effect sizes. Group comparisons were performed at the subject level using two-sided tests, with false-discovery rate (FDR) control applied within each task’s feature family.
In the visual search/enumeration task (Task 1), the amblyopia group showed less efficient scanning behavior and reduced stability, including higher scan-path entropy, more regressions, and larger inter-fixation step sizes (
Table 6;
Figure 3a–d; all FDR
p < 0.001).
In the grating-viewing task (Task 2), amblyopic participants showed delayed stabilization after stimulus onset and increased orientation-referenced drift and dispersion (
Table 7; all FDR
p < 0.001). Orientation-resolved curves further demonstrated larger deviations for oblique orientations (
Figure 3e–h), consistent with orientation-dependent processing differences.
In smooth pursuit (Task 3), the amblyopia group exhibited larger position error, reduced pursuit gain, and more frequent catch-up saccades (
Table 8; all FDR
p < 0.001), indicating impaired eye–target coordination under binocular viewing (
Figure 3i–l).
1.9. Subject-Level Classification Performance and Probability Calibration
Using the full feature set and the nested, subject-wise evaluation protocol, the proposed Random Forest achieved strong discrimination between unilateral amblyopia and controls at the subject level (AUC = 0.966, 95% CI: 0.92–0.99; AP = 0.972, 95% CI: 0.90–0.99;
Figure 4a,b;
Table 9). Isotonic calibration improved probability reliability for the aggregated subject probabilities (
Figure 4c), yielding an expected calibration error of 0.04 and a Brier score of 0.095.
At the operating threshold selected within the inner loop (maximizing Youden’s J), outer-loop held-out predictions yielded TP = 31, FN = 4, TN = 33, and FP = 2 (
Figure 4d), corresponding to sensitivity = 0.886 and specificity = 0.943 (
Table 9). Aggregated predicted probabilities showed clear separation between groups, with most controls falling below the decision threshold and most amblyopia cases above it (
Figure 4e), supporting stable subject-level decisions at the chosen operating point. Learning-curve analysis further indicated that AUC increased with training-set size and showed diminishing returns near the full cohort (
Figure 4f).
1.10. Baseline Comparisons, Ablations, and Interpretability
To contextualize performance, we trained baseline classifiers on the identical feature set under the same nested, subject-wise protocol. Regularized logistic regression and linear SVM achieved lower AUC and accuracy than the Random Forest, while gradient-boosted trees were comparable but did not exceed the primary model (
Table 10).
Task-ablation analyses supported complementary contributions of the three tasks. Single-task models were less accurate than the combined model, and combining tasks improved discrimination, with the full three-task model achieving the highest AUC under the same evaluation protocol (
Figure 4g,
Table 11).
Model interpretability analyses were consistent with task-level effects. Permutation importance highlighted that the most informative predictors were predominantly derived from the grating-viewing and smooth-pursuit tasks, followed by search-related measures (
Figure 4h), suggesting that the classifier leveraged distributed evidence across multiple oculomotor domains rather than a single cue.
1.11. Robustness and Sensitivity Analyses
Because recordings were acquired at 30 Hz under natural head posture, we evaluated robustness to reasonable variations in preprocessing and event definitions. Across a grid of fixation-segmentation thresholds (dispersion threshold: 0.75°/1.0°/1.25°; minimum duration: 80/100/120 ms) and two derivative pipelines (Savitzky–Golay vs. low-pass filtering prior to finite-difference velocity), subject-level AUC varied within a narrow range (0.960–0.968) and the top-ranked feature families remained unchanged (
Table 12).
We also examined dependence on the modeled gaze stream. Repeating the full pipeline using (i) cyclopean gaze (primary), (ii) amblyopic-eye-only gaze, and (iii) fellow-eye-only gaze demonstrated that the cyclopean and amblyopic-eye-only streams yielded similar performance, whereas the fellow-eye-only stream was lower but remained above chance (
Table 13). This pattern suggests that amblyopia-related oculomotor signatures are preserved under binocular viewing, while the amblyopic-eye stream carries the strongest discriminative information.
The clinical-only baseline achieved moderate discrimination, whereas the gaze-based model performed substantially better. Combining clinical variables with gaze features yielded a small additional gain, indicating that gaze dynamics contribute complementary information beyond standard clinical measures (
Table 14).
1.12. Decision—Analytic Implications Under Real-World Prevalence
Because the study used a balanced case–control design (50/50), positive and negative predictive values will differ in community screening, where prevalence is lower. Using the operating-point sensitivity and specificity estimated above, we computed expected PPV and NPV across plausible prevalence values (
Table 15). At 5% prevalence, the expected PPV was approximately 0.45, while NPV exceeded 0.99, indicating that the tool is well suited to rule out amblyopia and to triage children for confirmatory clinical assessment.
2. Discussion
This study demonstrates that a brief, task-evoked oculomotor battery yields robust and interpretable behavioral signatures of unilateral amblyopia in school-aged children. Across the three tasks, amblyopic participants showed consistent impairments in gaze stability, saccadic organization, and pursuit fidelity. When these physiologically motivated features were combined in a calibrated Random Forest model, we obtained strong subject-level discrimination and well-calibrated probabilities, supporting the feasibility of task-evoked eye movements as functional screening biomarkers.
Several aspects of the pipeline address common methodological concerns for pediatric wearable eye tracking. All modeling steps were performed within a nested, subject-wise protocol to avoid leakage, with calibration and threshold selection confined to the training data within each split. Data quality remained high across tasks, and although the task order was fixed, order-related changes in the invalid-sample fraction were small and similar across groups, arguing against major fatigue-driven confounding under this short protocol. Importantly, because recordings were acquired at 30 Hz under natural head posture, we evaluated robustness to reasonable variations in preprocessing and event definitions; performance remained stable across these choices. Consistent with this sampling rate and the prototype context, our “small corrective saccade-like event” measure should be interpreted as a coarse index of corrective behavior rather than a canonical microsaccade metric.
Binocular viewing is ecologically appropriate for screening but could in principle attenuate unilateral signatures. We therefore assessed dependence on the modeled gaze stream. Using cyclopean gaze as the primary signal and repeating the pipeline with amblyopic-eye-only and fellow-eye-only signals, we found comparable performance for cyclopean and amblyopic-eye-only inputs, whereas fellow-eye-only performance was lower. This suggests that amblyopia-related oculomotor signatures remain detectable under binocular viewing and that the amblyopic-eye stream carries substantial discriminative information.
The multi-task design also appears to capture complementary information. Task ablation showed that combining tasks improved performance over single-task models, and feature-importance analyses highlighted contributions from pursuit- and grating-derived descriptors, while search-related measures provided additional evidence. For screening applications, such complementarity is valuable because children’s engagement and data quality can vary across tasks and individuals.
With respect to comparators, we benchmarked the proposed approach against standard classical baselines trained under the same nested protocol (regularized logistic regression, linear SVM, and gradient-boosted trees). Comparisons with deep-learning models (e.g., CNNs/transformers) are of interest, but fair evaluation of higher-capacity models typically requires larger datasets to control variance and to support leakage-free hyperparameter selection; thus, deep-learning benchmarking is best pursued in larger multi-center cohorts.
A central translational consideration is that predictive values depend on prevalence. While the present case–control design is balanced, we reported expected PPV/NPV across plausible community prevalence ranges, emphasizing the intended use as a screening/triage tool to refer children for confirmatory clinical assessment rather than as a stand-alone diagnosis. Relatedly, this study did not include a head-to-head comparison against chart-based acuity screening or commercial instrument-based photoscreeners (e.g., Spot Vision Screener) because these modalities were not collected in parallel under matched referral criteria. Prospective validation should therefore include guideline-consistent head-to-head comparisons and external testing on independent cohorts.