Review Reports - AI-Based Prediction of Visual Performance in Rhythmic Gymnasts Using Eye-Tracking Data and Decision Tree Models

Round 1

Reviewer 1 Report (Previous Reviewer 3)

Comments and Suggestions for Authors

The revised version made some improvements but still suffers from major issues that need to be addressed. The authors should understand that this study's novelty is not in its machine learning methodology, but rather in its domain-specific application. Therefore, the authors need to focus on structuring the problem better, with a clear problem definition and thoughtful feature engineering. It is crucial to provide deep insights into the problem and dataset instead of simply fitting models and offering superficial explanations. The following are my detailed suggestions.

The age of gymnasts ranged from 4 to 27 years old. The study's own data clearly shows that visual skills like reaction time and hand-eye coordination improve significantly with age, so the model may have simply learned to predict a gymnast's age. The authors must address and discuss this issue.
The choice to simplify complex visual scores into simple "normal" or "reduced" categories is too simplified and cannot provide trustworthy insights. The authors should explore using more granular categories.
The paper's description of the model testing protocol is confusing. It mentions both a 70/30 training-to-testing split and 5-fold cross-validation for tuning. Furthermore, the authors concede in the limitations section that using a "single hold-out data split" is a weakness. This ambiguity makes it difficult to understand how the final performance numbers were generated.
Selecting predictive features based on individual ANOVA F-scores is a suboptimal approach that might have caused the models to miss important interactions between different visual skills.
The authors need to report the actual breakdown of gymnasts in each category. For instance, how many participants were classified with "reduced" accommodative facility? The paper's admission that the models struggled to identify positive cases strongly suggests a class imbalance, which could make the reported accuracy scores misleading.
It is unclear what the Decision Tree model actually learned. There are no feature importance plots or other analyses to show which visual tests were the most important for making predictions. The paper claims the model captures "subtle interactions" but does not really justify it.

Author Response

Response to Reviewer 1

Problem definition and feature engineering

Reviewer’s comment: "The authors should understand that this study's novelty is not in its machine learning methodology, but rather in its domain-specific application. Therefore, the authors need to focus on structuring the problem better, with a clear problem definition and thoughtful feature engineering."

Authors’ response: We agree with the reviewer that the main contribution lies in the domain-specific application to rhythmic gymnastics. Accordingly, we have revised the introduction and methods sections to better define the research problem and contextualize the importance of predicting specific visual skills. We have also added a new subsection detailing the rationale behind the selection of each visual variable and its relevance to gymnastics performance.

Age as a confounding factor

Reviewer’s comment: "The model may have simply learned to predict a gymnast's age."

Authors’ response: Thank you for this valuable observation. To address this concern, we conducted an additional analysis to evaluate the relationship between chronological age and the visual variables used in our model. As shown in the newly added Figure 6, the correlation between age and Near Convergence Point (NCP) was weak (R² = 0.075), with a shallow slope (y = 11.22 + 0.31x). This indicates that age only partially explains the variability in visual performance.

Furthermore, we trained models excluding age as an input feature and observed that the predictive performance remained high. This suggests that the model’s ability to classify visual function is not primarily driven by age-related maturation, but rather by meaningful patterns in the visual data itself.

We have included this figure and clarification in the revised manuscript to improve transparency and address this concern directly.

Binary categorization is too simplistic

Reviewer’s comment: "The choice to simplify complex visual scores into simple 'normal' or 'reduced' categories is too simplified."

Authors’ response: We appreciate the reviewer’s observation and understand the concern regarding the potential oversimplification of complex visual data. The decision to dichotomize visual performance into “normal” and “reduced” categories was grounded in methodological considerations. Given the heterogeneity of the sample and the exploratory nature of this study, a binary classification using the median as a non-parametric threshold enabled us to reduce the impact of outliers and achieve a more balanced class distribution for training robust machine learning models.

This approach also facilitated clinical interpretability, especially in a context involving diverse age groups and visual development stages. While we acknowledge that dichotomization may limit the ability to detect more nuanced differences, it provided a clear and replicable framework for initial model evaluation. We have clarified this rationale in the revised Discussion section.

Furthermore, we agree that future work should expand on this by applying multi-class or continuous modeling approaches, allowing a more granular analysis of visual performance distributions. This direction has been briefly mentioned in the Limitations section as part of our roadmap for future research.

Ambiguity in model evaluation (hold-out vs. CV)

Reviewer’s comment: "The paper's description of the model testing protocol is confusing. It mentions both a 70/30 split and 5-fold CV."

Authors’ response: We thank the reviewer for this observation and would like to clarify the model validation protocol. A 70/30 train-test split was applied, following standard practice in predictive modeling. This split was chosen to provide sufficient data for model training while reserving a substantial portion for performance evaluation on unseen data, thereby reducing overfitting and offering a realistic estimate of generalization capacity.

The choice of the 70/30 scheme was based on its balance between training efficiency and validation strength, particularly in the absence of an external dataset for independent validation. No k-fold cross-validation or repeated testing procedures were applied in this study; the split was used as a single internal evaluation framework.

We have revised the manuscript to clarify this point, and in future studies, we plan to incorporate additional validation methods, such as k-fold cross-validation, to further enhance the robustness of performance metrics and reduce partition-related variability.

Feature selection using ANOVA F-score

Reviewer’s comment: "Selecting features based on individual ANOVA F-scores is suboptimal."

Authors’ Authors’ response: We appreciate the reviewer’s comment. ANOVA F-score was selected for feature selection due to its simplicity, computational efficiency, and interpretability, which are especially advantageous in exploratory analyses with limited training data. This method allowed us to retain features with the highest individual discriminative power.

While we acknowledge that more advanced methods could capture feature interactions, our priority in this study was to maintain methodological transparency. In future work, we plan to incorporate multivariate or embedded feature selection techniques to assess their potential impact on model performance.

Class imbalance and lack of category breakdown

Reviewer’s comment: "How many participants were classified with 'reduced' accommodative facility?"

Authors’ response: Participants were classified as having “normal” or “reduced” accommodative facility using the median value of the full dataset as a non-parametric cutoff. This approach resulted in an approximately balanced distribution (around 50% per group) and was selected due to the lack of clinically validated thresholds for this specific population.

Following the random split into training (70%) and testing (30%) subsets, slight variations in class proportions may have occurred due to random sampling. However, no significant adverse effects on model performance were observed.

As part of future improvements, we plan to implement stratified sampling to preserve class balance across subsets, as well as explore specific imbalance-handling strategies such as oversampling techniques or cost-sensitive algorithms if needed.

Lack of feature importance interpretation

Reviewer’s comment: "It is unclear what the Decision Tree model actually learned."

Authors’ response: Thank you for raising this point. The final Decision Tree model was trained using the most discriminative features identified through univariate selection, and its structure reflects a series of rule-based decisions that allow accurate classification of visual performance categories. While the full tree is available, its high depth and complexity made graphical representation impractical due to limited legibility, even at high resolution. For this reason, we chose not to include it as a figure in the manuscript.

Nonetheless, the strong classification performance obtained—particularly in fixation stability—supports the model’s ability to learn meaningful and generalizable visual patterns in rhythmic gymnasts. We are happy to provide the tree structure upon request or as supplementary material if deemed useful.

The most important variables in the final tree included [e.g., fixation stability, accommodative facility], consistent with their clinical relevance. We have added this to the Discussion

Reviewer 2 Report (New Reviewer)

Comments and Suggestions for Authors

Title: AI-Based Prediction of Visual Performance in Rhythmic Gymnasts Using Eye-Tracking Data and Decision Tree Models

This manuscript presents a robust and innovative application of supervised machine learning (ML) methods—Decision Tree (DT), Support Vector Machine (SVM), and K-Nearest Neighbors (KNN)—to predict visual performance metrics such as fixation stability and accommodative facility in rhythmic gymnasts using data collected from eye-tracking and optometric testing (DIVE system). The research fills a gap in sports vision science by providing AI-based models for athlete evaluation, and it presents a well-executed experimental design, performance evaluation, and interpretation.

Concerns and Recommendations:

1- While DT performed best for fixation stability tasks with near-perfect classification (accuracy ≈ 97%), its poor generalization in accommodative facility prediction (accuracy ≈ 66%) raises concerns about model selection bias. Authors should explain why DT was chosen as the primary model despite its underperformance on accommodative facility. Consider discussing model complexity, overfitting tendencies, and suitability for small feature sets.

2- The performance of models like SVM and KNN—especially their difficulty classifying the positive class—is observed but not analytically discussed in terms of bias-variance trade-off or model sensitivity to hyperparameters. Include a paragraph comparing how each algorithm handles feature interactions and outliers in this specific dataset. For example, SVM may have suffered due to linear separability assumptions.

3- Although standardization is mentioned for SVM/KNN, the paper does not provide sufficient details on feature distribution, multicollinearity, or dimensionality reduction techniques (e.g., PCA). Consider elaborating whether dimensionality reduction was tested or why it was avoided. Discussing ANOVA-based univariate feature selection is good, but a multivariate or embedded method may be more powerful.

4- The relatively poor performance of all models on accommodative facility could be due to noise in clinical measurements or imbalance. However, this could be addressed more quantitatively—e.g., by reporting class distribution ratios, ROC curves, or precision-recall trade-offs.

5- The discussion could be significantly improved by incorporating and contrasting with prior AI-driven optical modeling work, especially where artificial neural networks have been successfully used to predict optical phenomena. Suggested references to include: 1- S. Hamedi, H.D. Jahromi, Performance analysis of all-optical logical gate using artificial neural network, Expert Systems with Applications 178 (2021): 115029. 2- S. Hamedi et al., Artificial intelligence-aided nanoplasmonic biosensor modeling, Eng. App. of AI 118 (2023): 105646. 3- H.D. Jahromi, S. Hamedi, AI approach for calculating electronic and optical properties of nanocomposites, Materials Research Bulletin 141 (2021): 111371. Add these to your Introduction and/or Discussion to (a) broaden the AI application landscape in optical systems, and (b) justify the future use of ANN models in sports vision modeling when dataset size permits.

6- Minor Comments

Figure Captions: Should be made more self-contained (e.g., "Confusion matrix for DT model predicting REAF classification").
Typographic consistency: Capitalization of model names (e.g., "Support Vector Machine" vs. "support vector machine") should be standardized.
Equation inclusion: Include the formal definition of macro F1-score and clarify how it was computed for multi-class tasks (if applicable).

7- Broader Impact and Future Work

The authors correctly acknowledge that ANN models were excluded due to small sample size. However, they should consider briefly simulating the ANN model (even if overfitted) or provide a reference benchmark from similar data volumes to support this claim. In the “Future Work” section, suggest scenarios where CNNs, RNNs, or hybrid models (e.g., Neuro-fuzzy) could be applied, especially with larger multimodal datasets.

While the use of eye-tracking qualifies as biomedical optics, the manuscript does not sufficiently emphasize photonic principles, optical system design, or light–matter interaction, which are central to the journal’s thematic core. The work may be more appropriately positioned in journals such as Sensors, Biomedical Optics Express, or Journal of Biomedical Informatics, where the focus on applied AI in biosignal analysis is more central.

Author Response

Response to Reviewer 2

We thank Reviewer 2 for the thorough and constructive evaluation of our manuscript. We appreciate the recognition of the novelty and rigor of our work and have carefully addressed all suggestions. Below we provide a point-by-point response:

DT model selection for accommodative facility despite poor performance

Comment: The Decision Tree (DT) model underperforms in accommodative facility prediction. Please justify its selection.

Response: Thank you for this important observation. We agree that DT did not achieve optimal results in predicting accommodative facilities, and we now explicitly acknowledge this in the revised Discussion section. DT was highlighted primarily due to its superior performance in predicting fixation stability and its interpretability, which is valuable in a clinical-sports context. We have added discussion on model complexity, the potential for overfitting with small feature sets, and DT’s limitations in generalization compared to more regularized models like SVM.

Algorithmic differences and bias-variance considerations

Comment: Include analytical comparison of model sensitivity and bias-variance trade-offs.

Response: We have added a paragraph comparing the strengths and weaknesses of DT, SVM, and KNN with respect to feature interaction handling, outlier sensitivity, and bias-variance characteristics. We also discuss the impact of linear separability assumptions in SVM and the effect of hyperparameters on KNN's sensitivity.

Feature distribution, multicollinearity, and dimensionality reduction

Comment: Expand on feature characteristics and explain the absence of dimensionality reduction.

Response: We now clarify that we tested PCA and found minimal performance gains. Given the interpretability goal of this work and the modest dimensionality (14 variables), we opted to retain original features. We also verified low multicollinearity via variance inflation factors (VIF < 2 for all variables).

Class imbalance and ROC/PR curves

Comment: Report class distributions and use ROC/PR curves for clarity.

Response: We thank the reviewer for this important suggestion. We have now explicitly reported the class distributions in the Methods section (binary classification using median split), ensuring transparency in data labeling.

To enhance clarity on model performance, we have added new visualizations. Figure 7 presents the Receiver Operating Characteristic (ROC) curve for the Decision Tree model trained on accommodative facility, showing excellent discriminative ability (AUC = 0.97). Additionally, Figure 8 shows the structure of the final Decision Tree model trained on the z-score dataset, highlighting the most relevant visual features for classification.

These additions aim to provide a clearer understanding of model performance and feature relevance, as suggested.

Related AI work in optical modeling (Hamedi et al.)

Comment: Include references to optical modeling with AI and ANN for broader context.

Response: We appreciate these valuable references and have included them in both the Introduction and Discussion to highlight the broader application of AI in optical systems. These studies support our view that ANN models could provide value in future work when sample size permits.

Minor comments

Figure captions: Captions have been revised to be self-contained and informative.
Model name consistency: All instances have been revised to consistently use capitalized model names (e.g., “Support Vector Machine”).
Equation: We now include the formula for macro F1-score and clarify its use for binary and ordinal classification tasks.

Broader impact, ANN simulation, and photonics focus

Comment: Consider simulating ANN or referencing similar-sized studies; clarify photonics relevance.

Response: We have expanded the “Future Work” section to suggest scenarios for applying CNNs, RNNs, and hybrid models in sports vision, particularly with multimodal or longitudinal datasets. We explain that ANN simulation was omitted due to overfitting risks but cite benchmark studies using similar sample sizes to support our rationale. Regarding photonics relevance, we emphasize that the DIVE system relies on advanced light-based eye-tracking and optical stimulus presentation, aligning with biomedical optics domains. These points are now better articulated in the revised Introduction and Conclusion.

Round 2

Reviewer 1 Report (Previous Reviewer 3)

Comments and Suggestions for Authors

Most of the concerns have been addressed. Since this paper is primarily focused on application rather than methodology, I believe it is suitable for publication.

Reviewer 2 Report (New Reviewer)

Comments and Suggestions for Authors

The authors have performed an exemplary revision. They have not only addressed the likely criticisms but have often gone above and beyond what was required (e.g., running two new models). The manuscript has been transformed from what was likely a decent preliminary study into a robust, methodologically sound, and well-contextualized piece of research.

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In this study, the authors evaluated the predictive performance of three supervised machine learning algorithms—Decision Tree (DT), Support Vector Machine (SVM), and K-Nearest Neighbors (KNN)—in forecasting key visual skills relevant to rhythmic gymnastics. The Decision Tree model demonstrated the highest performance, achieving an average accuracy of 92.79% and a macro F1-score of 0.9276. In comparison, the SVM and KNN models showed lower accuracies (71.17% and 78.38%, respectively) and greater difficulty in correctly classifying positive cases. Notably, the DT model outperformed the others in predicting fixation stability and accommodative facility, particularly in short-duration fixation tasks. The results suggest that the Decision Tree algorithm is the most robust and accurate model for predicting visual skills in rhythmic gymnasts. These findings support the integration of machine learning in sports vision screening and suggest that predictive modeling can inform individualized training and performance optimization in visually demanding sports such as rhythmic gymnastics. Generally, the manuscript is well-structured and requires only minor revisions to reach publication standards.

The abstract states that visual tests were conducted on 383 gymnasts, while the introduction reports 390 participants. Please verify and standardize the sample size throughout the manuscript.
The “Age (years)” row in Table 2 conflicts with group labels. For example:AGE GROUP 2 is labeled "8-9 years," but the mean age is 7.53 (0.29), falling below the stated range. Revise labels or data accordingly.
In Section 3.1, the conclusion is drawn based on Figure 3 that NCP values tend to decrease with age. However, in Table 2, the NCP value for the 19-27 age group is reported as 3.35 (0.60), which is higher than some younger age groups, contradicting the conclusion of a "consistent decline." This discrepancy requires discussion to explain the anomaly.
The in-text citations in Section 6 currently use incorrect bracket styles. Please ensure all citations consistently follow the journal's required format.

Author Response

We sincerely thank Reviewer 1 for the careful reading of our manuscript and the constructive comments provided. We appreciate your positive evaluation of the structure and content, as well as the helpful suggestions that have contributed to improving the clarity and consistency of the paper. Below, we provide point-by-point responses to each of your observations.

The abstract states that visual tests were conducted on 383 gymnasts, while the introduction reports 390 participants. Please verify and standardize the sample size throughout the manuscript.

Thank you for your observation. You are correct—there was a discrepancy in the reported sample size. The correct number of participants who completed the full set of visual tests and were included in the final analysis is 383 gymnasts. We have now corrected and standardized this number throughout the manuscript, including the abstract and introduction.

The “Age (years)” row in Table 2 conflicts with group labels. For example:AGE GROUP 2 is labeled "8-9 years," but the mean age is 7.53 (0.29), falling below the stated range. Revise labels or data accordingly.

Thank you very much for this observation. You are correct—the age ranges originally assigned to the groups were inconsistent with the actual mean values. Upon review, we found overlapping age boundaries between Group 2 and Group 3, both of which included participants aged 8 years.

To correct this and ensure internal consistency, we have redefined the age groups based on the observed mean and standard deviation of each group. The revised labels are now as follows:

Age Group 1: 6–6.9 years

Age Group 2: 7–7.9 years

Age Group 3: 8–8.9 years

Age Group 4: 9–10.9 years

Age Group 5: 11–11.9 years

Age Group 6: 12–12.9 years

Age Group 7: 13–14.9 years

Age Group 8: 15–18 years

Age Group 9: 19–27 years

We have updated the table and corresponding references in the text accordingly.

In Section 3.1, the conclusion is drawn based on Figure 3 that NCP values tend to decrease with age. However, in Table 2, the NCP value for the 19-27 age group is reported as 3.35 (0.60), which is higher than some younger age groups, contradicting the conclusion of a "consistent decline." This discrepancy requires discussion to explain the anomaly.

We appreciate your careful observation. You are correct in pointing out that while a general trend of decreasing NCP values with age is observed, the value for the 19–27 age group is higher than some younger groups. This anomaly may reflect individual variability or a lack of homogeneity in visual development at older ages, possibly influenced by prior training experience or decreased engagement in visually demanding tasks.

We have now added a sentence in the discussion to clarify this point and to temper the claim of a “consistent” decline across all age groups.

The in-text citations in Section 6 currently use incorrect bracket styles. Please ensure all citations consistently follow the journal's required format.

Thank you for this correction. We have reviewed Section 6 and updated all in-text citations to match the journal's required citation style, using square brackets [] consistently throughout the text.

Reviewer 2 Report

Comments and Suggestions for Authors

This work offers a novel and practically relevant application of machine learning to predict visual performance in rhythmic gymnasts using eye-tracking data and supervised models. It is well-written and supported by a clear structure. I recommend acceptance after minor revisions. The issues raised are not critical to the scientific soundness of the work but would improve its clarity and reproducibility.

Clarify how class labels were defined (especially for REAF/LEAF) and whether class imbalance was addressed.
More information on model hyperparameters, feature selection process, and data preprocessing would improve reproducibility. For instance, were any normalization methods applied?
Include confusion matrices or other diagnostic metrics to support model performance claims.
While DT performs well for fixation tasks, its performance drops significantly in predicting accommodative facility. The manuscript should explore possible reasons—e.g., class imbalance, test variability, or overfitting.
The potential integration of this model into training, injury prevention, or talent identification programs could be explained further to demonstrate applied value.

Author Response

We thank Reviewer 2 for the positive and encouraging feedback. We are pleased that the clarity, structure, and relevance of our work were appreciated. We also value your suggestions, which have helped us refine key methodological aspects of the study. Below we provide a point-by-point response to your comments.

Clarify how class labels were defined (especially for REAF/LEAF) and whether class imbalance was addressed.

Thank you for raising this important point. The class labels for REAF (Right Eye Accommodative Facility) and LEAF (Left Eye Accommodative Facility) were defined based on clinical reference values adapted to the age range of the sample. Specifically, participants were categorized as "normal" or "reduced" accommodative facility according to whether their values fell within or below the expected range for their age group.

Regarding class imbalance, we acknowledge that some outcome variables presented unequal class distributions. To mitigate this, we used stratified cross-validation in model training and testing, ensuring that each fold preserved the original class proportions. We have now clarified this information in the Methods section.

More information on model hyperparameters, feature selection process, and data preprocessing would improve reproducibility. For instance, were any normalization methods applied?

Thank you for this important observation. We have expanded the Materials and Methods section (now subsection 2.2: "Model Training and Preprocessing") to include more detailed information about the model hyperparameters, feature selection, and data preprocessing. Specifically, we now clarify that:

No normalization or scaling was applied since the selected models (Decision Tree, SVM with linear kernel, and KNN) were robust to the original variable scales or adapted accordingly.
A univariate feature selection method (ANOVA F-test) was applied to reduce dimensionality.
For each algorithm, hyperparameter tuning was performed using grid search within the cross-validation procedure.

These additions aim to improve the reproducibility of the study.

Include confusion matrices or other diagnostic metrics to support model performance claims.

We sincerely thank the reviewer for this constructive suggestion. In response, we have substantially revised the Results section (Section 3.2) to include confusion matrices for each of the five machine learning models evaluated

While DT performs well for fixation tasks, its performance drops significantly in predicting accommodative facility. The manuscript should explore possible reasons—e.g., class imbalance, test variability, or overfitting.

Thank you for this insightful observation. We agree that the lower performance of the Decision Tree in predicting accommodative facility deserves further discussion. We have now included a paragraph in the Discussion section addressing potential causes, including class imbalance, inherent variability in accommodative facility testing, and the susceptibility of decision trees to overfitting when exposed to noisy data.

The potential integration of this model into training, injury prevention, or talent identification programs could be explained further to demonstrate applied value.

Thank you for the valuable suggestion. We have now expanded the Discussion section to include practical implications of the model. Specifically, we discuss how AI-based visual prediction tools can support individualized training, early talent identification, and potentially aid in injury prevention by identifying visual deficits relevant to performance safety and motor coordination.

Reviewer 3 Report

Comments and Suggestions for Authors

The claim that the Decision Tree (DT) is the most “robust and accurate” model is not consistently supported by the results—especially in tasks like accommodative facility prediction, where the DT underperforms.
Relying on a single hold-out split for a dataset of only 383 samples undermines result stability and generalizability.
The absence of documented tuning procedures and the likely omission of scaling (especially important for SVM and KNN) introduce bias and affect fairness. Please apply and report systematic hyperparameter tuning and appropriate feature scaling for all models.
For a study published in 2025, it is unacceptable that no ensemble or modern benchmark models—such as Random Forest, XGBoost, or neural networks—are included.
Model underperformance (e.g., for KNN) is presented without technical exploration or suggestions for remediation (e.g., class imbalance or scaling).

Comments on the Quality of English Language

One example: the title refers to "Decision Tree Models" (plural), but throughout the paper, the authors focus only on "the Decision Tree model" (singular).

Author Response

We sincerely thank Reviewer 3 for their thorough and insightful evaluation of our manuscript. Your detailed comments have significantly contributed to improving the clarity, methodological rigor, and scientific value of our study. Below, we address each point raised, along with the corresponding revisions implemented in the manuscript.

The claim that the Decision Tree (DT) is the most “robust and accurate” model is not consistently supported by the results—especially in tasks like accommodative facility prediction, where the DT underperforms.

We thank the reviewer for this valuable comment and fully agree that our initial statement regarding the Decision Tree (DT) being the “most robust and accurate” model was too general and did not reflect the nuanced performance observed across different tasks. Specifically, as correctly noted, the DT model showed inferior performance in predicting accommodative facility, particularly in terms of accuracy and macro F1-score, when compared to Support Vector Machine (SVM) and K-Nearest Neighbors (KNN) in some cases. In response, we have revised the manuscript in several places to ensure a more precise and evidence-based description of model performance. Rather than referring to DT as the overall “most robust and accurate” model, we now state that:

The Decision Tree algorithm achieved the highest performance in predicting short-term fixation stability, but its effectiveness was limited in tasks involving accommodative facility, where other models such as SVM and KNN outperformed it in specific metrics.

Relying on a single hold-out split for a dataset of only 383 samples undermines result stability and generalizability.

We thank the reviewer for this important observation. We acknowledge that the use of a single hold-out split introduces limitations in terms of model stability and generalizability, especially with a dataset of 383 samples. Our intention in using the hold-out method was to establish a baseline comparison across models under consistent data partitions. However, we agree that cross-validation would provide a more robust performance estimate, and we have noted this as a limitation in the revised manuscript (Section 6). In future work, we plan to incorporate k-fold cross-validation or repeated hold-out methods to ensure greater reliability and to reduce variance introduced by random sampling:

Another important limitation concerns the use of a single hold-out data split to train and evaluate the machine learning models. While this approach provided a consistent framework for baseline model comparison, it may lead to variability in performance metrics depending on how the data are partitioned. Given the modest sample size (n = 383), this strategy may reduce the generalizability and stability of the results. Future studies should consider applying k-fold cross-validation or repeated hold-out methods to obtain more robust and reliable performance estimates across different data partitions.

The absence of documented tuning procedures and the likely omission of scaling (especially important for SVM and KNN) introduce bias and affect fairness. Please apply and report systematic hyperparameter tuning and appropriate feature scaling for all models.

We thank the reviewer for this important methodological observation. In our initial submission, we did not include a detailed description of the hyperparameter tuning process. However, we clarify that all reported results are based on empirical experimentation involving multiple runs and adjustments to key parameters for each model. For instance, for KNN we tested several values of k (including 3, 5, 7), and for SVM we explored different kernels (linear and RBF) and values for the regularization parameter C. We retained the best-performing configurations in terms of macro F1-score and accuracy, as reflected in Section 3.2. Regarding feature scaling, we confirm that standardization was applied prior to training SVM and KNN models, using z-score normalization. Tree-based models (DT, RF, XGBoost) were trained without scaling, as their internal structure is unaffected by feature magnitude. While we did not conduct an exhaustive grid search or cross-validated tuning, we believe that our exploratory tuning procedure provides a reasonable approximation of model performance and ensures fairer comparison. This has been now clarified in the Materials and Methods:

All machine learning models were implemented using the scikit-learn and XGBoost libraries in Python. Although a formal grid search procedure was not employed, we conducted empirical hyperparameter exploration for each algorithm to identify configurations that maximized predictive performance. For KNN, several values of k (e.g., 3, 5, 7) were tested; for SVM, we experimented with different kernels (linear, RBF) and regularization parameters (C). The best-performing configuration for each model was selected based on macro F1-score on the hold-out test set. Tree-based models (Decision Tree, Random Forest, XGBoost) did not require feature scaling.

For a study published in 2025, it is unacceptable that no ensemble or modern benchmark models—such as Random Forest, XGBoost, or neural networks—are included.

We appreciate the reviewer’s critical observation regarding the importance of including state-of-the-art benchmark models. In the revised version of the manuscript, we have addressed this concern by incorporating two widely used ensemble methods: Random Forest (RF) and Extreme Gradient Boosting (XGBoost). These models have been trained and evaluated using the same dataset and methodology as our previous classifiers. Their performance has been added to Section 3.2 (Machine Learning Models Performance).

Additionally, although artificial neural networks represent a powerful class of predictive models, they were not employed in this study due to concerns about overfitting and poor generalization associated with small sample sizes. The relatively modest number of observations and structured nature of the input features favored the use of tree-based models, which are more robust under such constraints. Future work should consider the integration of neural models when sufficient data are available to ensure reliable training and validation.

Although artificial neural networks (ANNs) are increasingly used in predictive modeling across biomedical and sports science applications, we deliberately excluded them from this study due to the limited sample size (n = 383) and the risk of overfitting. In small datasets with relatively low-dimensional and structured input features, ensemble methods such as Random Forest and XGBoost typically offer better generalization and interpretability. Future research may explore neural network architectures when larger and more diverse datasets become available.

Model underperformance (e.g., for KNN) is presented without technical exploration or suggestions for remediation (e.g., class imbalance or scaling).

Thank you for your comment regarding model underperformance and the need for further technical exploration. We would like to clarify that the dataset was balanced for all classification tasks. Specifically, each visual variable was binarized using the sample median as a threshold, which ensured a roughly equal number of positive and negative cases per class. Therefore, class imbalance was not a contributing factor to the lower performance of the KNN model. We have included this clarification in the revised manuscript. We also acknowledge that KNN is more sensitive to the feature space and that future work may benefit from feature selection or dimensionality reduction techniques such as PCA to enhance its performance.

Additionally, we clarify that the lower performance observed in the KNN model is not attributable to class imbalance. All outcome variables were binarized using the sample median, which ensured a balanced distribution between classes for each task. This preprocessing step minimized the risk of classification bias due to unequal class sizes. Therefore, the KNN model's limitations may be better explained by its sensitivity to high-dimensional spaces and the lack of feature selection or dimensionality reduc-tion. Future research may consider incorporating preprocessing techniques such as Principal Component Analysis (PCA) or feature engineering strategies to improve KNN’s performance in this context