An Open-Source Horizontal Strabismus Simulator as an Evaluation Platform for Monocular Gaze Estimation Using Deep Learning Models
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsComments and Suggestions for Authors
Basically, this paper deals with the design and development of a low-budget strabismus simulator as a platform for evaluating monocular gaze estimation systems based on deep learning models. Although the authors state in the title that it is a simulation of vertical and horizontal strabismus, the paper describes only the case of simulation and evaluation of horizontal strabismus. However, as stated in 1. Introduction, horizontal strabismus accounts for 90% of all diagnosed cases of strabismus.
- What is the main question addressed by the research?
The main question and subject of the presented research is the realization of a low-cost strabismus simulator, whose mechanical precision of movement is significantly higher than the clinically minimally allowed reading error in clinical diagnostic procedures (0.1° vs ~0.57°). It is stated that the precision of commercially available eye-trackers is typically 0.5-1.0°, from which it can be concluded that this mechanical precision of the simulator would be sufficient for the evaluation of these devices, and not only for the evaluation of targeted appearance-based gaze estimation systems based on deep learning models.
- Do you consider the topic original or relevant to the field? Does it address a specific gap in the field? Please also explain why this is/ is not the case.
This topic is significant, because it provides a high-quality and inexpensive solution for the realization of a strabismus simulation system, transparently presented in an open-source manner and easy to reproduce in less demanding laboratories. In the experimental part of the work, i.e. in the application of the simulator, it is shown how the simulator is applied and what its significance is in the further development of existing CNN models for gaze estimation.
- What does it add to the subject area compared with other published material?
Developed open-source simulator represents an important technical foundation for evaluating monocular gaze in strabismus patients. The mechanical accuracy achieved at the 0.1° level significantly surpasses the minimum detection unit (1 prism diopter ≈ 0.57°) used in clinical diagnosis, making the platform sufficiently reliable. While existing evaluation systems assume normal binocular vision, this system has uniqueness in its ability to reproduce nonconjugate eye movements specific to strabismus. Besides that, the low cost (approximately 200 USD) and complete open-source implementation help create an environment in which research institutions with limited budgets can participate in strabismus research.
- What specific improvements should the authors consider regarding the methodology?
The methodology is very detailed and quite precisely formulated. I have no particular complaints.
- Are the conclusions consistent with the evidence and arguments presented and do they address the main question posed? Please also explain why this is/is not the case.
The conclusions are short, clear and concise, with clearly stated and confirmed contributions. The conclusions are preceded by a well-conceived and detailed discussion in which strengths and weaknesses are highlighted and plans for further development are given.
- Are the references appropriate?
The paper relies on a satisfactory number of references that have been selected appropriately for the presented research.
- Any additional comments on the tables and figures.
- The text that explains Figure 1 (lines 87-90) should be part of the regular text written in the same font, which refers to Figure 1. The servo motor for horizontal rotation on Figure 1 is incorrectly labeled (it should be FS0307, ​​not FS0007). I assume that the reference [20] is incorrectly given, and the correct one is [22] (Paysan P et al.).
-The text that explains Figure 2 should be part of the regular text written in the same font, which refers to Figure 2. Perhaps it should be stated (for the sake of understanding for the wider population) that the designation MPU6050 refers to the gyro sensor (MEMS), and GY-521 is the PCB I2C interface module on which the MPU6050 is located, and through which the Arduino communicates with the gyro sensor.
Comments for author File:
Comments.pdf
Author Response
We sincerely thank the reviewer for taking the time to review our manuscript and for providing valuable and constructive comments. We have carefully addressed all comments, and the revised sections are highlighted in yellow in the manuscript. Our responses are detailed below.
Comments#1: The text that explains Figure 1 (lines 87-90) should be part of the regular text written in the same font, which refers to Figure 1. The servo motor for horizontal rotation on Figure 1 is incorrectly labeled (it should be FS0307, ​​not FS0007). I assume that the reference [20] is incorrectly given, and the correct one is [22] (Paysan P et al.).
Author’s response#1: Thank you for pointing out these issues. We have corrected the font of the Figure 1 caption (lines 109–112) from Times New Roman to Palatino Linotype, consistent with the main text. Additionally, we have corrected the typographical error in the figure (FS0007 → FS0307) and the incorrect reference number in the figure caption. Please note that reference numbers have been renumbered due to revisions made in response to other reviewer comments.
Comments#2: The text that explains Figure 2 should be part of the regular text written in the same font, which refers to Figure 2. Perhaps it should be stated (for the sake of understanding for the wider population) that the designation MPU6050 refers to the gyro sensor (MEMS), and GY-521 is the PCB I2C interface module on which the MPU6050 is located, and through which the Arduino communicates with the gyro sensor.
Author’s response#2: Thank you for this helpful suggestion. We have corrected the font of the Figure 2 caption to Palatino Linotype, consistent with the main text. Furthermore, we have revised the caption to provide a clearer explanation as suggested. The revised caption now includes the following clarification (page 5, Figure 2 caption): "The gyro sensor used in this system is the MPU6050, a Micro Electro Mechanical Systems (MEMS)-based inertial measurement unit, mounted on a GY-521 breakout board that provides an I2C interface for communication with the Arduino Nano microcontrollers." Please let us know if any further clarification is needed.
Author Response File:
Author Response.docx
Reviewer 2 Report
Comments and Suggestions for AuthorsThis manuscript investigates the development of an evaluation platform for monocular gaze estimation technology in strabismus patients, aiming to address the unvalidated performance of existing gaze estimation models under strabismic conditions and the lack of appropriate evaluation tools. To resolve this issue, the authors propose an open-source horizontal and vertical strabismus simulator, implementing real-time high-precision angle measurement through dual independently controlled artificial eyeballs integrated with vertical servo motors and gyro sensors, with experimental validation of its assessment performance on three deep learning models. The research possesses clinical translation value, but before recommendation for publication, the following issues require modification:
â‘ Although the authors provide detailed descriptions of the system angular calibration methodology and AI model evaluation data collection process in Sections 2.2 and 2.3 respectively, the overall experimental workflow lacks intuitive visualization. It is recommended to supplement flowcharts in the System Calibration and Data Collection for AI Model Evaluation sections to visually present the methodological framework and data processing procedures.
â‘¡The angular error evaluation metric lacks a precise mathematical definition. It is recommended to provide the exact formula for error calculation and coordinate system definition to ensure the rigor of assessment results.
â‘¢While the authors state that the developed simulator supports both horizontal and vertical movements, the experimental section does not appear to leverage this capability to simulate and evaluate model performance across different strabismus types. It is recommended to evaluate model performance by strabismus type subgroup rather than reporting only aggregate error metrics.
Author Response
We sincerely thank the reviewer for taking the time to review our manuscript and for providing specific and constructive suggestions to improve our paper. We believe that addressing your comments has significantly enhanced the clarity and readability of the manuscript. We have carefully addressed all comments, and the revised sections are highlighted in yellow in the manuscript. We respond to your comments as follows.
Comments#1: Although the authors provide detailed descriptions of the system angular calibration methodology and AI model evaluation data collection process in Sections 2.2 and 2.3 respectively, the overall experimental workflow lacks intuitive visualization. It is recommended to supplement flowcharts in the System Calibration and Data Collection for AI Model Evaluation sections to visually present the methodological framework and data processing procedures.
Author’s response#1: Thank you for this appropriate suggestion regarding the lack of intuitive understanding. In response to your comment, we have added two flowcharts to illustrate the system validation procedure in Section 2.2 and the data collection process for AI model evaluation in Section 2.3:
Figure 4: Flowchart of the system validation protocol (Section 2.2). This figure illustrates the system calibration procedure, including the validation process of 100 trials × 3 sets.
Figure 5: Flowchart of the data collection protocol for AI model evaluation (Section 2.3). This figure shows both the data acquisition loop and the subsequent image processing pipeline.
Furthermore, we have revised the descriptions in Section 2.2 (lines 144–146) and Section 2.3 (lines 163–176) to ensure textual clarity and consistency with the new figures. We believe these additions allow readers to intuitively grasp the experimental workflow. We appreciate your valuable feedback.
Comments#2: The angular error evaluation metric lacks a precise mathematical definition. It is recommended to provide the exact formula for error calculation and coordinate system definition to ensure the rigor of assessment results.
Author’s response#2: We appreciate the reviewer's comment regarding the lack of mathematical definitions for the angular error evaluation metrics. In response to this suggestion, we have added the coordinate system definition and the formulas used for calculating angular errors.
The coordinate system is defined as follows: the gaze angle is represented as a two-dimensional vector , where denotes the horizontal component (yaw) and denotes the vertical component (pitch). The frontal gaze direction is defined as the origin (0°, 0°), with positive indicating rightward rotation and positive indicating upward rotation.
Based on this coordinate system, we have introduced two equations: Equation (1) represents the error between the commanded angle and the gyro sensor-measured angle for mechanical validation, and Equation (2) represents the error between the AI model-estimated angle and the gyro sensor-measured angle for model evaluation.
For detailed explanations, please refer to lines 150–160 for the coordinate system definition and Equation (1), and lines 196–202 for Equation (2).
We hope these additions ensure the rigor of the assessment methodology.
Comments#3: While the authors state that the developed simulator supports both horizontal and vertical movements, the experimental section does not appear to leverage this capability to simulate and evaluate model performance across different strabismus types. It is recommended to evaluate model performance by strabismus type subgroup rather than reporting only aggregate error metrics.
Author’s response#3: We appreciate the reviewer's insightful suggestion. The developed simulator is capable of independent control of both horizontal and vertical axes, enabling simulation of vertical strabismus as well (Figures 1 and 2). However, as the reviewer correctly pointed out, this study focused exclusively on horizontal strabismus for the following reasons. First, horizontal strabismus accounts for more than 90% of all strabismus cases [1, 2], representing the highest clinical significance. Second, the primary objective of this study was to demonstrate a proof of concept for a novel approach to evaluating gaze estimation AI models under strabismic conditions, and we determined that it was essential to first establish efficacy in horizontal strabismus, which has the greatest clinical relevance.
In light of this, we have added explicit statements clarifying that the evaluation was limited to horizontal strabismus (lines 12, 16, and 36–39 in the revised manuscript).
Additionally, the extension of evaluation to vertical strabismus has been explicitly stated as a future research direction in Section 4.2 (Future Directions) of the Discussion (lines 278–282).
Author Response File:
Author Response.docx
Reviewer 3 Report
Comments and Suggestions for AuthorsReview comments on jemr-4062410
The paper presents an open-source low-cost strabismus simulator that is capable of reproducing disconjugate eye movements to evaluate monocular gaze estimation AI models. The simulator includes two independently controllable artificial eyeballs that are mounted on a two-axis gimbal mechanism using server motors and gyro sensors for real time angle measurement.
The paper conducted evaluation of the simulator and reported its mechanical accuracy, which outperformed the three representative AI models, single eye, gaze net, and eyenet.
The main contributions of the paper include 1) development of the strabismus simulator, which demonstrates advantages in design cost, mechanical accuracy, and functionalities; 2) evaluation of the simulator in terms of accuracy and other metrics, exhibiting effectiveness of the simulator.
The topic of the paper has high importance, as strabismus is less concerned by existing studies in eye gaze research. Presenting an effectives strabismus simulation tool can helpfully improve research in this area.
The paper presents rich details about the development, specifications, and evaluation results of the proposed simulator. However, there are several major weaknesses that hinder the paper to be considered for publication. These are summarised below.
- The paper seems like a mechanical experiment report, which puts a strong focus on mechanical design and technical evaluation, whereas does not involve sufficient scientific research, e.g., developing a scientific research question and hypothesis, and subsequently validate it. Due to this, the paper does not draw its contributions in the whole field.
- The paper describes the importance of strabismus as a visual disorder, which has a large population and needs more attention. It also states that “current ocular alignment examinations require specialized equipment and trained orthoptists and are therefore primarily limited to ophthalmology clinics”, which indicates that absence of examination equipment is a notable barrier. However, it is ambiguous how does the strabismus simulator serves as an effective examination tool, especially when it does not involve real data of strabismus patients.
- “However, collecting large amounts of data from patients is ethically and practically difficult [12,13], and existing simulators (UnityEyes [14] and U2Eyes [15]) assume normal binocular vision and are therefore inapplicable to strabismus research.” –this statement indicates the insufficiency of data of strabismus patients. However, it is confusing why the physical simulator is a must, as for the purpose of data collecting, digital simulation well works in this case.
- The paper does not provide a comprehensive review of literatures and therefore, fails to reflect the comprehensive picture of strabismus research. Although it presents a number of references, it is unclear whether there are same simulators devised by other researchers, how do these digital simulation tools and systems work -if these exist, are there any user studies involving strabismus research and how did these proceed without sufficient data, are there any eyeball systems already being invented and used in which real applications, etc. As a result, the paper does not have a strong and solid grounding for the development of the simulator.
- It is a merit that the paper includes many technical details of simulator development, which allow reproduction of it. However, the description of these details fails to highlights technical soundness. It is not saying that Arduino and related sensors and actuators are not allowed, but rationales for these designs are largely missing. Consequently, this makes the paper like a mechanical course report, which has many technical details, without showing why doing these.
- Many parts are incomplete. For instance, in 2.2 system calibration, this is an essential part to simulator evaluation. However, neither calibrations nor data analysis were clearly elaborated. In 2.4 evaluation of AI models for monocular gaze estimation, the experiment procedures are ambiguous either.
- The paper only conducts evaluation with datasets in comparison with existing AI models. No user studies were involved. This makes evaluation results confusing as it lacks sufficient validation from the clinical effectiveness perspective, e.g., how does the simulator meet the high requirement for clinical examination of strabismus disorder? Without such validations, the outcomes of the paper are quite limited.
- The reasons for choosing three AI models are not well justified.
- In 3. Results, the findings are reported in a clear manner. Despite technical clarity, their scientific implications are ambiguous.
- In 4. Discussion, the paper does not address my concerns raised previously. It fails to reflect effectiveness of the proposed simulator from a scientific perspective. Its practical implications are discussed in quite a limited way. The current discussion is more like a short repetition of study findings rather than extending new understandings.
Overall, the paper is technically appealing, however, as scientific research, it fails in multiple perspectives, including but not limited to research question development, literature review, experiment design and rigorous data analysis, and scientific novelty.
Author Response
We sincerely thank the reviewer for taking the time to review our manuscript. We appreciate your understanding of the significance of our research theme while providing critical feedback from a scientific perspective regarding the shortcomings of our paper. Your comments have enabled us to substantially improve the theoretical aspects of our work. We have carefully addressed all comments, and the revised sections are highlighted in yellow in the manuscript. We have made the following revisions.
Comments#1: The paper seems like a mechanical experiment report, which puts a strong focus on mechanical design and technical evaluation, whereas does not involve sufficient scientific research, e.g., developing a scientific research question and hypothesis, and subsequently validate it. Due to this, the paper does not draw its contributions in the whole field.
Author’s response#1: We sincerely appreciate this important comment. We acknowledge that the original manuscript lacked explicit research questions and hypotheses. In the revised version, we have substantially restructured the Introduction (lines 36–101) to clearly present the scientific framework of this study.
Specifically, we have explicitly stated two research questions: (1) Can a physical strabismus simulator achieve sufficient mechanical accuracy for AI model evaluation? (2) How accurately can existing monocular gaze estimation models, trained on healthy subjects, estimate gaze direction under simulated strabismic conditions?
Furthermore, we have presented explicit hypotheses: first, that the physical simulator would achieve mechanical accuracy below 0.57° (1 prism diopter); and second, that existing AI models would exhibit degraded gaze estimation accuracy under strabismic conditions compared to their reported performance on normal subjects.
We believe this revision clarifies the scientific contributions of this study—namely, achieving sufficient mechanical accuracy of the simulator and establishing the first baseline performance metrics for gaze estimation under strabismic conditions using a physical simulation platform.
Comments#2: The paper describes the importance of strabismus as a visual disorder, which has a large population and needs more attention. It also states that “current ocular alignment examinations require specialized equipment and trained orthoptists and are therefore primarily limited to ophthalmology clinics”, which indicates that absence of examination equipment is a notable barrier. However, it is ambiguous how does the strabismus simulator serves as an effective examination tool, especially when it does not involve real data of strabismus patients.
Author’s response#2: Thank you for this comment. We apologize that our original manuscript did not clearly convey our intention. We would like to emphasize that this simulator is intended as an evaluation platform for assessing AI model performance under strabismic conditions, rather than as a clinical screening tool. We have carefully rewritten the Introduction to clarify this distinction.
In the revised Introduction, we have restructured the logical flow as follows: (lines 40–47) limitations of current ocular alignment examinations; (lines 48–58) limitations of existing automated methods including Hirschberg-based approaches and eye-tracking; (lines 59–66) potential of deep learning-based gaze estimation; (lines 67–75) technical gap—existing models have not been validated under strabismic conditions; and (lines 76–87) the need for an evaluation platform for deep learning-based gaze estimation models under strabismic conditions.
Furthermore, we have explicitly stated that establishing baseline performance metrics is necessary for developing strabismus-specific models, thereby clarifying the role of this simulator in the research pipeline.
Comments#3: “However, collecting large amounts of data from patients is ethically and practically difficult [12,13], and existing simulators (UnityEyes [14] and U2Eyes [15]) assume normal binocular vision and are therefore inapplicable to strabismus research.” –this statement indicates the insufficiency of data of strabismus patients. However, it is confusing why the physical simulator is a must, as for the purpose of data collecting, digital simulation well works in this case.
Author’s response#3: We appreciate this important question. Modifying existing digital simulators such as UnityEyes and U2Eyes to generate disconjugate eye movements is considered technically challenging. These systems are fundamentally designed under the assumption of normal binocular vision, and enabling independent control of each eye would require substantial architectural modifications.
Our physical simulator provides a more straightforward and practical approach for reproducing strabismic eye movements. Furthermore, physical simulators offer the advantage of enabling evaluation under real-world lighting conditions that more closely approximate clinical deployment scenarios. Additionally, physical platforms offer inherent extensibility in appearance—in the future, artificial eyeballs can be easily replaced or modified to simulate various eye characteristics such as different iris colors, scleral features, and pathological appearances.
Comments#4: The paper does not provide a comprehensive review of literatures and therefore, fails to reflect the comprehensive picture of strabismus research. Although it presents a number of references, it is unclear whether there are same simulators devised by other researchers, how do these digital simulation tools and systems work -if these exist, are there any user studies involving strabismus research and how did these proceed without sufficient data, are there any eyeball systems already being invented and used in which real applications, etc. As a result, the paper does not have a strong and solid grounding for the development of the simulator.
Author’s response#4: Thank you for this valuable comment. In the revised manuscript, we have expanded the literature review regarding existing simulators.As a physical simulator relevant to strabismus research, Lotze et al. (2022) developed EyeRobot, a robotic oculomotor simulator capable of emulating eccentric fixation and eye misalignment. This system was designed for validating the accuracy of infrared-based eye-tracking hardware (Pupil Core). The EyeRobot study adopted an evaluation approach using known ground-truth angles rather than comparative validation with actual patient data, with the primary objective of instrument validation using the simulator (lines 80–83).Our study adopts a similar approach. Since collecting large amounts of data from strabismus patients is ethically and practically difficult [14, 15], using a simulator with known ground-truth angles enables systematic AI model evaluation while circumventing this constraint.Existing digital simulators (UnityEyes [16] and U2Eyes [17]) are designed under the assumption of normal binocular vision, and generating disconjugate eye movements would require architectural modifications (lines 78–79).Through this literature review, we have clarified the gap that this study addresses: EyeRobot was designed for infrared-based instrument validation, and to date, no physical simulator exists for evaluating appearance-based gaze estimation models using standard RGB cameras (lines 83–84).
Comments#5: It is a merit that the paper includes many technical details of simulator development, which allow reproduction of it. However, the description of these details fails to highlights technical soundness. It is not saying that Arduino and related sensors and actuators are not allowed, but rationales for these designs are largely missing. Consequently, this makes the paper like a mechanical course report, which has many technical details, without showing why doing these.
Author’s response#5: Thank you for acknowledging the reproducibility of our technical documentation. We agree that the rationale for component selection was not sufficiently explained in the original manuscript.
In the revised version, we have clarified that all components were selected based on open-source hardware principles [19–21]. The primary design objectives were to maximize accessibility and reproducibility for researchers worldwide, particularly those at institutions with limited budgets. The Arduino Nano microcontrollers were selected for their global availability, comprehensive documentation, and active community support. The MPU6050 gyro sensors and FS0307 servo motors were chosen because they provide sufficient accuracy for this application while maintaining low cost and ease of integration.
This design philosophy aligns with established open-source hardware initiatives in scientific research, such as the OpenFlexure Microscope [21] and other low-cost research instruments [20]. These have demonstrated that accessible and reproducible hardware can make meaningful scientific contributions while democratizing research capabilities globally.
Comments#6: Many parts are incomplete. For instance, in 2.2 system calibration, this is an essential part to simulator evaluation. However, neither calibrations nor data analysis were clearly elaborated. In 2.4 evaluation of AI models for monocular gaze estimation, the experiment procedures are ambiguous either.
Author’s response#6: Thank you for this comment. In the revised manuscript, we have made the following improvements to clarify the experimental procedures.
Section 2.2 (System Validation):
We have added a flowchart (Figure 4) to clarify the calibration procedure and data analysis methods. Specifically, the flowchart illustrates the protocol in which each set consisted of 100 trials of independent binocular rotation (±30° horizontal and vertical), with the final mean absolute error calculated as the mean ± SD across three sets. Additionally, we have explicitly stated the MAE calculation formula (Equation 1) and the coordinate system definition (lines 150–160) to quantitatively describe the data analysis methodology.
Section 2.3 (Data Collection for AI Model Evaluation):
We have added a flowchart of the data collection protocol (Figure 5) to visually clarify the experimental procedures. This figure illustrates the process in which each set consisted of 500 trials of independent binocular horizontal rotation (±15°) recorded every 1 second, yielding a total of 1500 eye images with synchronized ground-truth angles across three sets (lines 162–176).
Section 2.4 (Evaluation of AI Models for Monocular Gaze Estimation):
The evaluation metric for AI models (Equation 2) has been explicitly stated (lines 196–202).
Comments#7: The paper only conducts evaluation with datasets in comparison with existing AI models. No user studies were involved. This makes evaluation results confusing as it lacks sufficient validation from the clinical effectiveness perspective, e.g., how does the simulator meet the high requirement for clinical examination of strabismus disorder? Without such validations, the outcomes of the paper are quite limited.
Author’s response#7: Thank you for your comment regarding clinical validation.As stated in our responses to Comments #2 and #4, the purpose of this simulator is to provide an AI model evaluation platform rather than a clinical screening tool. This approach is based on a research strategy similar to that of EyeRobot by Lotze et al. (2022) [18].The primary contribution of this study is establishing baseline performance metrics for existing AI models under strabismic conditions. We believe this foundational knowledge is essential for the development of strabismus-specific models and clinical applications.We agree that validation with real patients is essential for future clinical application. This has been explicitly stated in the Limitations section of the Discussion (lines 286–297). Our research pipeline envisions a stepwise approach: proof of concept in this study, followed by validation with real patients in subsequent work.
Comments#8: The reasons for choosing three AI models are not well justified.
Author’s response#8: (lines 183-187)  Thank you for this comment. In the revised manuscript, we have clarified the rationale for model selection in the Methods section (Section 2.4, lines 178–184).
The three selected models—Single Eye (four-layer CNN), GazeNet (VGG16-based), and EyeNet (ResNet18-based)—share a common architecture consisting of feature extraction layers followed by fully connected layers to estimate gaze direction from monocular eye images. This configuration represents the most fundamental approach in monocular gaze estimation and was deemed appropriate for baseline evaluation.
Furthermore, these models differ in architectural complexity (four-layer CNN → VGG16 → ResNet18). This selection allows us to examine whether deeper architectures contribute to performance improvement under strabismic conditions. By establishing baseline performance metrics with representative models, we provide a foundation for future evaluation of more advanced architectures.
Comments#9: In 3. Results, the findings are reported in a clear manner. Despite technical clarity, their scientific implications are ambiguous.
Comments#10: In 4. Discussion, the paper does not address my concerns raised previously. It fails to reflect effectiveness of the proposed simulator from a scientific perspective. Its practical implications are discussed in quite a limited way. The current discussion is more like a short repetition of study findings rather than extending new understandings.
Author’s response#9, 10: We appreciate your comments regarding the scientific interpretation of the research findings. As these are related comments, we will address them together.
We acknowledge that the original Discussion section was overly focused on restating the results. In the revised manuscript, we have made the following changes.
Revision of the Introduction: As stated in our response to Comment #1, we restructured the Introduction to clearly state the research questions and hypotheses (lines 88–94). Specifically, we established two research questions: (1) Can a physical strabismus simulator achieve sufficient mechanical accuracy for AI model evaluation? (2) How accurately can existing monocular gaze estimation models, trained on healthy subjects, estimate gaze direction under simulated strabismic conditions?
Restructuring of the Discussion: In the opening paragraph of the Discussion (lines 229–238), we have explicitly stated: (1) the verification results for both hypotheses; (2) the finding that existing gaze estimation models cannot be directly applied to strabismus screening without modification; and (3) the scientific insight for future research that the development of strabismus-specific models is essential for clinical application.
Additionally, we have provided concrete discussion of the positioning and future prospects of this simulator, including comparison with EyeRobot as mentioned in our response to Comment #4 (lines 240–246), and the extensibility to vertical strabismus as mentioned in our response to Comment #1 (lines 276–278).
We believe these revisions have transformed the manuscript from a technical report into a scientific study with clear hypotheses, validation, and implications. We would appreciate your feedback on whether these changes adequately address your concerns.
Author Response File:
Author Response.docx
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsI have no further questions.
Author Response
Thank you very much for reviewing our manuscript.
In this revision, we received 12 detailed comments from Reviewer 3 and have addressed all of them, resulting in substantial revisions to the Introduction, Methods, and Discussion sections.
In the Introduction, we restructured the research questions into three distinct questions with clarified interrelationships, provided detailed justification for choosing a physical simulator, and expanded the literature review to include comparisons with existing systems such as UnityEyes, U2Eyes, NVGaze, and EyeRobot. In the Methods section, we added details on calibration procedures, image preprocessing, and synchronization methods, as well as theoretical explanations for the rationale behind model selection and the √2 reduction assumption. In the Discussion, we restructured the content to clearly address each research question and added new sections on "Clinical Interpretation" and "Scientific Contributions."
In addition, the manuscript has undergone professional English language editing to improve readability and clarity throughout.
We would be most grateful if you could review the revised manuscript at your convenience.
Author Response File:
Author Response.docx
Reviewer 3 Report
Comments and Suggestions for AuthorsThis paper presents the development and validation of an open-source, low-cost (approximately $200 USD) horizontal strabismus simulator designed to evaluate AI-based monocular gaze estimation models.
The paper is motivated by the fact that 2-4% of the global population has strabismus—predominantly horizontal—and that existing gaze estimation models assume normal binocular vision without validation for strabismus patients. The authors built a physical simulator capable of reproducing disconjugate eye movements. The system is developed with two independently controllable artificial eyeballs mounted on a two-axis gimbal mechanism with servo motors and gyro sensors for real-time angle measurement.
The simulator demonstrated high mechanical accuracy with 0.1° mean absolute error across all axes, well below the clinical detection threshold of 1 prism diopter (≈0.57°). The authors evaluated three representative AI models (Single Eye, GazeNet, and EyeNet) using this simulator and found that even the best-performing model (EyeNet) exhibited estimation errors of 6.44–6.66°, which substantially exceeds the clinical target of 2.8°. Additionally, all models showed rapid accuracy degradation beyond ±15° gaze angles. The findings reveal significant limitations in current monocular gaze estimation technology for strabismus applications and highlight the need for specialized model development. The complete open-source design aims to facilitate further research in automated strabismus screening.
The revised paper has demonstrated some improvements in clarity and research motivation. However, it remains weak in presenting a clear and comprehensive picture of research. As mentioned in the previous review comments, the paper remains to be like an experiment report, rather than a scientific research paper. Some fundamental issues are not satisfactorily addressed in the revised paper.
The remained weaknesses are summarised below.
- The research questions and hypotheses, mentioned in line 87-91, remain insufficient. There are no sufficient literature reviews supporting the development of the research questions, and it is ambiguous about the relationship between the two research questions. Another point is that the paper does not explain why the physical system is a must, rather than a configurable digital system in e.g., VR environments?
- The scientific contributions of the paper remain unclear. The current contributions are the open-source platform, baseline performance metrics, and foundation for future development. But the key research questions are irrelevant with these contributions. And it does not make much sense in describing the contribution to future development, as the current paper does not systematically validate the reliability and validity of the proposed system.
- Regarding literature review, in the previous paper, there is insufficient coverage of existing simulators and strabismus research. The revised paper adds Lines 73-87 with references e.g., EyeRobot [18], UnityEyes [16], U2Eyes [17], and discusses their limitations. While better, it still lacks comprehensive coverage of other physical eye simulators in clinical/research settings, detailed comparison of digital vs. physical simulation advantages, and user studies in strabismus research methodologies.
Comprehensive review of eye-tracking simulators in broader HCI/vision science, and recent strabismus detection methods beyond Hirschberg and eye-tracking remain insufficient. - It remains relatively ambiguous about how simulator serves as examination tool and it is unclear why physical simulator needed. In the revised paper, lines 56-66 clarify the simulator's role in evaluating AI models rather than direct patient examination, and lines 73-87 explain limitations of digital simulators and lack of patient data. The distinction between evaluation platform and clinical tool needs to be clearer.
- Calibration and AI evaluation procedures remain ambiguous. Section 2.2 (lines 141-160) provides additional system validation protocol with MAE formula, section 2.4 (lines 177-202) clarifies AI evaluation methodology. However, their procedures need to be further explained for the reproducibility with clear metrics.
- Lines 178-185 explain models represent "most fundamental approach in monocular gaze estimation" and share common architecture. However, “most fundamental approach” is not supported with literatures. And it still lacks depth, e.g., why these specific three models? What about other architectures?
- There are a few repetitions in the Discussion without extending understandings. The revisions have incorporated comparison with existing platforms, future research directions, and comprehensive limitations. These are positive improvements. However, the discussions remain weak in terms of extending how effectively the research work has addressed the key research questions and how well does it benefit the field, theoretically and practically, or clinically?
- Regarding clinical validation, lines 272-278 acknowledge this as a limitation, with no user studies or clinical effectiveness validation. As a research paper that is deeply rooted in strabismus, absence of clinical validation is unacceptable.
- The revision could better contextualise e.g., what 6.44-6.66° error means for clinical practice beyond comparing to 2.8° target. It is recommended adding clinical interpretation (e.g., what strabismus angles would be missed).
- Lines 280-286 in the revised paper acknowledge artificial eyeball limitations. The domain shift issue (lines 247-252) is recognised but not deeply explored.
- $200 USD clearly stated (line 127, 242, 306) in revised paper, but which is the baseline of the cost of such a system?
- Some other technical issues, e.g., how precise is timestamp synchronization between video frames and gyro data? In lines 267-270, the future direction should be prioritised given the domain shift issue, and in lines 186-189, the √2 reduction assumption needs better justification.
Author Response
We sincerely thank you for providing 12 detailed and constructive comments on our manuscript. Your insightful feedback has substantially improved the quality of this paper.
We have addressed all of your comments. The major improvements are summarized below:
Research Framework:
- Restructured research questions into three distinct questions with clarified interrelationships
- Provided detailed justification for choosing a physical simulator
- Substantially expanded the literature review (added comparisons with UnityEyes, U2Eyes, NVGaze, EyeRobot, and other existing systems)
Methodological Clarification:
- Added details on calibration procedures, image preprocessing, and synchronization methods
- Explained the rationale for model selection based on the literature
- Provided theoretical justification for the √2 reduction assumption
Enhanced Discussion:
- Restructured to clearly address each research question
- Added a new "Clinical Interpretation" section detailing clinical implications
- Added a new "Scientific Contributions" section
- Expanded discussion on the impact of domain shift
In addition, the manuscript has undergone professional English language editing to improve readability and clarity throughout.
Detailed responses to each comment are provided below. We hope you find our revisions satisfactory.
Responses
Comments#1: The research questions and hypotheses, mentioned in line 87-91, remain insufficient. There are no sufficient literature reviews supporting the development of the research questions, and it is ambiguous about the relationship between the two research questions. Another point is that the paper does not explain why the physical system is a must, rather than a configurable digital system in e.g., VR environments?
Author's Response #1:
Based on your feedback, we made the following revisions:
(Lines 111-130) We restructured the research questions into three and clarified their relationships. We explicitly stated that RQ1 (simulator accuracy) serves as a prerequisite for RQ2 (AI model performance evaluation) and RQ3 (relationship between angular range and accuracy).
(Lines 120-128) We elaborated on the rationale for choosing a physical simulator: existing digital simulators lack the capability to reproduce non-conjugate eye movements, and modifying them would require substantial changes. In contrast, a physical simulator is relatively easy to implement by simply controlling each eyeball independently with servo motors. Furthermore, capturing images with an actual camera provides more appropriate conditions for evaluating appearance-based deep learning models.
Comments #2: The scientific contributions of the paper remain unclear. The current contributions are the open-source platform, baseline performance metrics, and foundation for future development. But the key research questions are irrelevant with these contributions. And it does not make much sense in describing the contribution to future development, as the current paper does not systematically validate the reliability and validity of the proposed system.
Author's Response #2:
(Lines 343-355) In Section 4.3 "Scientific Contributions," we clarified the scientific contributions in three points:
- Provision of the first systematic platform for evaluating gaze estimation models under strabismus conditions
- First quantification of performance limitations of existing models trained on healthy subject data under strabismus conditions
- Systematic analysis of the relationship between gaze angle range and estimation accuracy
These contributions directly correspond to the results of Research Questions 1-3.
Comments #3:
- Regarding literature review, in the previous paper, there is insufficient coverage of existing simulators and strabismus research. The revised paper adds Lines 73-87 with references e.g., EyeRobot [18], UnityEyes [16], U2Eyes [17], and discusses their limitations. While better, it still lacks comprehensive coverage of other physical eye simulators in clinical/research settings, detailed comparison of digital vs. physical simulation advantages, and user studies in strabismus research methodologies.
Comprehensive review of eye-tracking simulators in broader HCI/vision science, and recent strabismus detection methods beyond Hirschberg and eye-tracking remain insufficient.
Author's Response #3:
(Lines 80-109) We added the following simulators and methods in the Introduction:
- Synthetic image generators: UnityEyes, U2Eyes, NVGaze
- Computational biomechanical models: Orbit, SEE++
- GAN-based approaches: StyleGAN2-ADA
- Physical simulators: EyeRobot, 6-DOF biomimetic systems, artificial muscle-driven systems
We described the limitations of each system and compared them with the three essential functions that our simulator fulfills (reproduction of strabismus conditions, RGB image generation, and provision of known ground-truth angles).
Comments #4: It remains relatively ambiguous about how simulator serves as examination tool and it is unclear why physical simulator needed. In the revised paper, lines 56-66 clarify the simulator's role in evaluating AI models rather than direct patient examination, and lines 73-87 explain limitations of digital simulators and lack of patient data. The distinction between evaluation platform and clinical tool needs to be clearer.
Author's Response #4: (Lines 62-69, 111-130) We clarified in the Introduction and research question description (Lines 111-130) that the role of this simulator is "an evaluation platform for monocular gaze estimation AI models, not a direct patient examination tool." (Lines 356-374) Additionally, in Section 4.4 "Future Direction," we explicitly stated that this study is a preliminary study to enable clinical validation.
Comments #5: Calibration and AI evaluation procedures remain ambiguous. Section 2.2 (lines 141-160) provides additional system validation protocol with MAE formula, section 2.4 (lines 177-202) clarifies AI evaluation methodology. However, their procedures need to be further explained for the reproducibility with clear metrics.
Author's Response #5:
We added the following details:
- (Lines 182-184) Madgwick filter: We specified that filter parameters are provided in the Arduino source code in the GitHub repository.
- (Lines 208-210) Image preprocessing: We added the procedure for eye region detection using MediaPipe Iris, cropping to 60×36 pixels, grayscale conversion, and histogram equalization.
- (Lines 211-216) Synchronization method: We added an explanation of timing control for rotation commands and angle measurements using Processing software.
Comments #6: Lines 178-185 explain models represent "most fundamental approach in monocular gaze estimation" and share common architecture. However, “most fundamental approach” is not supported with literatures. And it still lacks depth, e.g., why these specific three models? What about other architectures?
Author's Response #6: (Lines 222-234) We elaborated on the rationale for model selection in Section 2.4:
- Reproducibility: Training datasets and model architectures are publicly available for all three models
- Graduated evaluation of complexity: The architectural complexity progressively differs from a simple 4-layer CNN (Single Eye) to VGG16-based (GazeNet) and ResNet18-based (EyeNet), enabling evaluation of whether model complexity contributes to performance improvement under strabismus conditions
- We also cited Cheng et al.'s comprehensive review [15] to demonstrate that CNN-based appearance feature extraction approaches are standard methods.
Comments #7: There are a few repetitions in the Discussion without extending understandings. The revisions have incorporated comparison with existing platforms, future research directions, and comprehensive limitations. These are positive improvements. However, the discussions remain weak in terms of extending how effectively the research work has addressed the key research questions and how well does it benefit the field, theoretically and practically, or clinically?
Author's Response #7:
(Lines 281-386) We restructured the Discussion to clearly correspond to responses to each research question:
- Response to RQ1: The simulator achieved MAE of less than 0.2° (below the clinical threshold of 1 PD ≈ 0.57°)
- Response to RQ2: All three models failed to achieve the clinical target of 2.8°, with errors of 6.44-8.75°
- Response to RQ3: Accuracy degradation with increasing angular range was confirmed
We also removed repetitions and added clinical interpretation (Section 4.1) and scientific contributions (Section 4.3) as independent subsections.
Comments #8: Regarding clinical validation, lines 272-278 acknowledge this as a limitation, with no user studies or clinical effectiveness validation. As a research paper that is deeply rooted in strabismus, absence of clinical validation is unacceptable.
Author's Response #8: (Lines 356-374, 375-386) We clarified the positioning of this study in the Future Direction and Limitation sections. This study is positioned as a "preliminary stage" for clinical validation. Strabismus screening requires monocular gaze estimation, and this is a proof-of-concept study to verify its accuracy in a simulated environment. By quantitatively demonstrating the limitations of existing models, this study has established the scientific rationale and ethical justification for clinical validation. Future research will verify the correlation between simulator evaluation results and actual patient measurements.
Comments #9: The revision could better contextualise e.g., what 6.44-6.66° error means for clinical practice beyond comparing to 2.8° target. It is recommended adding clinical interpretation (e.g., what strabismus angles would be missed).
Author's Response #9:
(Lines 299-320) We established a new Section 4.1 "Clinical Interpretation" and added the following clinical interpretations:
- Classification of strabismus severity: small-angle (<15 PD, approximately <8°), moderate (15-30 PD, approximately 8-17°), large-angle (≥30 PD, approximately ≥17°)
- With current estimation errors (6.44-6.66°), reliable detection of small-angle strabismus is difficult
- Moderate or larger angles (≥15 PD) fall within the detectable range even considering the errors
- Clinical significance of the high risk of missing small-angle strabismus, which is most important for amblyopia prevention
Comments #10: Lines 280-286 in the revised paper acknowledge artificial eyeball limitations. The domain shift issue (lines 247-252) is recognised but not deeply explored.
Author's Response #10: We elaborated on the effects of domain shift in the Discussion and Limitation sections. Artificial eyeballs cannot fully reproduce iris patterns, dynamic pupil changes, scleral vascular patterns, and corneal optical properties. Part of the performance degradation observed in this study is attributable to domain shift and needs to be interpreted separately from the effects of strabismus conditions themselves. Future research should conduct separate analysis of these two factors using actual strabismus patient data.
Comments #11: $200 USD clearly stated (line 127, 242, 306) in revised paper, but which is the baseline of the cost of such a system?
Author's Response #11: (Lines 97-100) We added a comparison with EyeRobot (estimated component cost of $200-500) in the Introduction. Our simulator costs approximately $200, which is comparable to EyeRobot's low cost while addressing a different purpose: evaluation of appearance-based deep learning models.
Comments #12: Some other technical issues, e.g., how precise is timestamp synchronization between video frames and gyro data? In lines 267-270, the future direction should be prioritised given the domain shift issue, and in lines 186-189, the √2 reduction assumption needs better justification.
Author's Response #12: (Lines 211-217) Regarding timestamp synchronization, Processing software provides unified control of timing for sending rotation commands to Arduino and angle measurements. The simulator receives rotation commands at 1-second intervals and acquires angle data after rotation completion, ensuring temporal correspondence between video frames and gyro measurements for each rotation cycle. (Lines 238-243) Regarding the √2 reduction assumption, based on measurement error theory, averaging multiple independent measurements reduces the standard error in proportion to the square root of the number of measurements [34]. In our method, estimates are made independently from the first and second eye images, making the two estimation trials independent, thus validating this assumption.
Author Response File:
Author Response.docx

