Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Integrating Multi-Task Eye Tracking and Interpretable Machine Learning for High-Accuracy Screening of Amblyopia in Pediatric Populations

J. Eye Mov. Res. 2026, 19(2), 26; https://doi.org/10.3390/jemr19020026

by Xiumei Song¹, Yunhan Zhang²

, Hongyu Chen¹, Chenyu Tang³

, Bohan Yao⁴, Hubin Zhao⁵

, Luigi G. Occhipinti³

, Arokia Nathan⁶, Changbin Zhai^1,*

and Shuo Gao^2,*

Reviewer 1:

Stefanos Balaskas

Reviewer 2: Anonymous

J. Eye Mov. Res. 2026, 19(2), 26; https://doi.org/10.3390/jemr19020026

Submission received: 4 December 2025 / Revised: 15 January 2026 / Accepted: 2 February 2026 / Published: 2 March 2026

(This article belongs to the Special Issue Digital Advances in Binocular Vision and Eye Movement Assessment)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In order to screen for unilateral amblyopia versus controls (N=70; 35/35), the manuscript suggests a 30-minute, kid-friendly, multi-task eye-tracking battery (visual search, orientation-dependent fixation, and sinusoidal smooth pursuit) and an interpretable Random Forest classifier with nested, subject-wise validation and isotonic probability calibration. Pursuit and orientation-dependent features emerge as the most significant, and reported subject-level performance is high (AUC≈0.97; accuracy≈0.91 at a selected operating point).

1) "Wearable binocular gaze at 30 Hz": make it readily apparent what kind of signal is being modeled (left, right, or both) and what this means for unilateral amblyopia.

Clearly say if features are calculated from the left eye only, the right eye only, the cyclopean (averaged), or the best eye, and how missing eye samples or blinks are dealt with.

If you merge eyes, explain why you think the combination will keep amblyopia signatures when viewed with both eyes.

You might want to add a sensitivity analysis that looks at features that use the amblyopic-eye stream compared to the fellow-eye stream (if you have it), or binocular versus monocular viewing (if you have it or can get a small sample).

2) The structure of the trial is not clear enough (there is a risk of hidden leakage and an unclear unit of analysis).

For each task, indicate how many trials each subject did, how many of those trials were usable after cleaning, and whether the number of trials is different for each group.

Clearly say that inner CV folds are grouped by subject, not trial, for both isotonic calibration and threshold selection.

Report performance with confidence intervals for sensitivity and specificity, not just AUC and AP. With N=70, binomial CIs are important for checking claims.

3) Eye-movement event detection at 30 Hz: explain why you chose the parameters and how you know the measurements are accurate

Add a brief validation/robustness section to show that important effects stay the same when you change the thresholds, like the I-DT dispersion threshold, the minimum duration, or the pursuit smoothing/derivative method.

Explain how filtering/smoothing and derivative estimation (Savitzky–Golay?) work. What are finite differences? low-pass?), and how you deal with noise and head movement.

Throughout, use "small corrective saccade-like events" instead of "small saccade rate," and don't suggest that microsaccades are valid.

4) Set task order (Task1→Task2→Task3): possible fatigue/learning confound

You should explain why the order is fixed (you do this briefly), but you should also say how much more data loss, more blinks, or less engagement later tasks show.

If possible, add an order-effect check (even post hoc). For example, compare the trial index or elapsed time with key features, or compare early and late trials within tasks.

5) There are no baselines or clinical comparators.
Add baselines like logistic regression (or linear SVM) on the same features and/or a simpler model for each task family.

If you can get to basic clinical covariates (BCVA, interocular difference, refraction, stereoacuity), think about:

a baseline that is only for clinical use,

and “clinical + eye-tracking” added value (even if just for research).

Report decision-analytic framing: at the very least, talk about how PPV/NPV would change outside of a balanced 50/50 sample and what the real prevalence is.

6) It is not clear how to report statistical tests and multiple comparisons.
Write a short paragraph about "group comparison statistics" that includes the tests used, the assumptions made, and the correction method. Or, if the comparisons are only descriptive, say so.

If you want to keep making inferences, you should think about controlling FDR across feature tests.

Author Response

Please see the attached file (“Response_to_Reviewer1.docx”) for our detailed point-by-point responses and the corresponding revisions in the manuscript.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This study designed three tasks for screening children for amblyopia: "animal icon counting," "Gabor patch orientation identification," and "sine wave tracking (Pursuit). Using machine learning (Random Forest) to analyze data collected from a wearable eye tracker, the system achieved high classification accuracy (AUC 0.966).

Overall, the use of clinically interpretable, handcrafted features and the use of nested cross-validation to prevent data leakage are considered excellent research methodology. However, hardware limitations (30 Hz) and the absence of a comparative experimental group hinder the study's effectiveness.

1. The choice of Random Forest over deep learning in a small data set (N=70) was a reasonable choice to prevent overfitting and ensure interpretability. Specially, the strict use of nested, group-aware cross-validation to prevent information leakage between training and testing data and the use of isotonic regression to correct probability values demonstrate a high level of rigor for medical AI research. Nevertheless, comparative experiments with existing studies based on algorithms such as DNNs, CNNs, and Transformers are necessary.

2. The authors used an eye tracker with a 30Hz sampling rate and presented the small-saccade rate as a key feature. Generally, a high-speed eye tracker with a sampling rate of at least 200Hz is recommended for analyzing micro-saccades or small saccades. While the authors acknowledge this in the limitations section, they lack an explanation of how reliable it is from a signal processing perspective to quantify microscopic movements of less than 1 degree using 30Hz data (approximately 33ms intervals) and use them as key features. This ambiguity creates an area of uncertainty where it is difficult to distinguish between noise and actual movement.

3. The device used is described only as a "Custom wearable binocular eye tracker." The specific sensor specifications, open source status, and references are unclear, making it difficult for third parties to replicate the results.

4. The study only mentions "school-aged," and the specific age distribution (mean, standard deviation) or gender distribution are not clearly revealed in the text summary. Since pediatric eye movement development changes rapidly with age, specific data demonstrating statistical significance of age-matching is needed. The dataset of 70 participants (35 patients, 35 normal subjects) is adequate for a pilot study, but the sample size is insufficient to generalize claims of "high-accuracy screening." The lack of an external validation set to demonstrate the generalization performance of the machine learning model is particularly disappointing. Amblyopia can be categorized as strabismic, anisometropic, or mixed. Although eye movement patterns may differ depending on the type (e.g., strabismic amblyopia may present with more severe fixation instability), no subgroup analysis was conducted.

5. This paper only presents the performance of the proposed model (AUC 0.966) and does not provide quantitative comparisons with existing standard testing methods or other studies.

The sensitivity/specificity of the proposed model needs to be compared with the sensitivity/specificity of screening methods based on visual acuity charts used in clinical practice.
To demonstrate clinical utility, a performance comparison with commercially available automated refraction devices (photoscreeners, e.g., Spot Vision Screeners) is required. If it is impossible, please remark the reason in the paper.
Comparison subject 3: An ablation study was conducted to compare with a single task, but there was no comparison with other eye-tracking-based algorithms proposed by other researchers (e.g, based on CNN, Transformer or other ML methods GDBoost, XGBoost etc).

Author Response

Please see the attached file for our detailed point-by-point responses and the corresponding revisions in the manuscript.

Article Menu

Integrating Multi-Task Eye Tracking and Interpretable Machine Learning for High-Accuracy Screening of Amblyopia in Pediatric Populations

Further Information

Guidelines

MDPI Initiatives

Follow MDPI