Advancing Cognitive–Motor Assessment: Reliability and Validity of Virtual Reality-Based Testing in Elite Athletes
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript presents a study evaluating the reliability and validity of VR-based cognitive-motor assessments in a large sample (over thousand elite athletes). Participants completed ten brain fitness tests targeting four domains of cognitive-motor ability: Balance & Gait, Decision-Making, Manual Dexterity, and Memory. Performance metrics such as sway, reaction time, and accuracy were collected via motion sensors embedded in the VR devices. Composite scores for each domain and a global score were calculated and analyzed for distribution characteristics and test-retest reliability. Results showed that while the global score was normally distributed, some domain scores displayed skewness. Nevertheless, high test-retest reliability was found across all domains. The authors suggest that VR assessments can provide ecologically valid and standardized measures of cognitive-motor abilities, with potential applications in psychological and athletic performance evaluation.
This study addresses an emerging and relevant area of research by combining immersive VR technology with cognitive-motor assessment in elite athletes and provides promising evidence for the potential utility of such tools in applied settings. The manuscript is well-written. The results are particularly interesting because they are based on a very large sample of elite athletes, and the high test-retest reliability demonstrates how precise and replicable the measurements can be when using this powerful tool. Overall, my evaluation of this study is positive, and I recommend it for publication. I have only a few minor comments listed below.
Specific comments
Could this tool be used to evaluate cognitive processes related to different types of performance (not necessarily sports-related)? For instance, could it be applied to assess suitability for driving or performing specific jobs? There might be many contexts where precise measurements of cognitive performance are needed.
In my opinion, future studies should investigate the sensitivity of this tool to determine whether it can detect differences between groups (e.g., experts vs. non-experts) or within the same participants (e.g., after specific training). This could be a valuable direction for the future development of this line of research.
If you compare the left- and right-hand data of the manual dexterity scores, would it be possible to determine whether, and to what extent, a person is left- or right-handed? In future studies, it could be interesting to compare such data with results from a self-report handedness inventory (e.g., the Edinburgh Handedness Inventory).
Author Response
Response to Reviewer 1
Thank you for your positive evaluation and insightful comments on our manuscript. We appreciate your recommendation for publication and your valuable suggestions for future research, which have helped us strengthen the paper.
In response to your specific comments:
- Non-sports-related applications: We have incorporated your suggestion to consider the broader applications of this tool. In the Discussion section, we now briefly touch upon how these precise cognitive-motor measurements could be applied to other contexts, such as assessing suitability for driving or specific jobs.
- Handedness analysis: We found your suggestion to explore handedness fascinating. We conducted a new analysis comparing left- and right-hand performance in the Manual Dexterity tests. As detailed in the Discussion, the results showed that the task was sensitive enough to detect significant performance differences between dominant and non-dominant hands for both left- and right-handed participants. We agree that comparing these data with a self-report inventory would be a valuable direction for future research and have now mentioned this in the manuscript.
Thank you again for your constructive feedback.
Reviewer 2 Report
Comments and Suggestions for AuthorsI find the following concerns need to be addressed:
The exclusive use of elite male athletes needs to be discussed in terms of how this population-specific sampling impacts the broader applicability of the results.
The arbitrary weighting schemes adopted to calculate composite scores and the simple average of scores in the "Global CM Score" lack statistical foundations such as factor analysis and cross-validation to objectively derive score weights.
The ceiling effect observed in the decision-making process should be discussed analytically.
All the reported claims regarding the ecological validity of tests must be supported by providing external behavioral comparisons and real-world correlations.
There is no discussion that considers the influence of learning effects, which would justify the reported high test-retest reliability.
Author Response
Response to Reviewer 2
Thank you for your thorough review and critical feedback. We have undertaken a significant re-analysis of our data to address the important concerns you raised. We believe the manuscript is now substantially improved as a result.
Here is how we have addressed your specific points:
- Population-specific sampling: We agree that the exclusive use of elite male athletes is a key limitation. We have expanded the Discussion section to explicitly acknowledge this and to elaborate on how this impacts the broader applicability of our findings to other populations. We also propose that future research should focus on validating these assessments in more diverse groups.
- Composite score weighting: You rightly pointed out the lack of statistical foundation for our original weighting schemes. We have now replaced this with a robust, data-driven approach. As detailed in the revised "Calculation of Composite Scores" section, we performed a Confirmatory Factor Analysis (CFA) to derive empirical weights (factor loadings) for each metric. The new composite scores, including the Global CM Score, are now calculated based on these statistically validated weights.
- Ceiling effect: We have expanded our discussion of the ceiling effect observed in the Decision-Making (DM) domain. The Discussion now provides a more detailed analysis, acknowledging that the test was likely too easy for this elite cohort and suggesting that this limited its sensitivity.
- Ecological validity and external comparisons: To better support our claims of ecological validity, we have included a new analysis in the Discussion section that demonstrates the Manual Dexterity task's sensitivity to expected real-world differences in performance between an individual's dominant and non-dominant hand.
- Learning effects and test-retest reliability: We acknowledge that our initial analysis did not consider the influence of the retest interval. To address this and potential learning effects, we have completely revised our test-retest reliability analysis. We stratified the data into two groups based on the time interval between tests (<48 hours and >48 hours). This new analysis, presented in Table 3, provides a more nuanced understanding of the measures' stability over different timeframes and shows that reliability remains high even over longer periods.
We are confident that these revisions fully address your concerns and have significantly strengthened the manuscript. Thank you for your valuable contributions.
Reviewer 3 Report
Comments and Suggestions for AuthorsOverall assessment. For the journal profile, the work appears to be generally appropriate: a large sample (N=1,213), clearly described VR tests, a transparent protocol for the formation of composite indicators, demonstration of distributions, and high test-retest reliability (ICC>0.994). This is a significant empirical contribution to the validation of VR assessment of cognitive-motor functions in elite athletes.
Main comments:
1. The authors normalize the metrics “relative to maximums” across the entire sample and set fixed weights (e.g., BG_Score = 0.70·TB + 0.25·DTB + 0.05·TW). This can distort scales, reinforce artifacts of a particular sample, and suppress the informativeness of tasks with low adherence (TW). I recommend: (i) standardization with Z-scores for each test, (ii) justification of weights through data (PCA/CFA, IRT/Rasch) or sensitivity analysis of weights, (iii) an alternative — percentile scaling on an independent norm.
2. The authors correctly show histograms and Q-Q, but in section 3.3 there is a typo “MC Score” instead of “ME Score.” Also, if individual domains are NOT normal (Shapiro–Wilk significant), then the analysis and conclusions for them should avoid parametric assumptions or use transformations/robust methods. Add 95% CI for ICC, SEM, MDC, Bland–Altman for clarity of stability.
3. The interval “from several hours to several days” may give inflated ICCs and not reveal stability in real monitoring intervals (weeks/months). An analysis taking into account the interval is needed (correlation ICC ~ interval; paired differences Session1 vs Session2; possible learning effects).
4. Conclusion regarding novelty: empirically significant, but methodologically expected; I recommend positioning more clearly what exactly is new: (i) scale N, (ii) sensitivity of Global CM to differences between groups (if possible, add case studies or correlations with “external” criteria — coaching ratings, injury history, neurocognitive screenings).
Conclusion. The work has a good empirical basis and practical value for sports science/neuroassessment. After significant revisions — primarily in terms of statistical reporting, composite construction, removal of the ceiling in DM, and more transparent disclosure of COI/ethics — the article can be considered for publication.
Author Response
Response to Reviewer 3
Dear Reviewer,
Thank you for your detailed and highly constructive feedback. Your suggestions have guided a significant revision of our statistical methodology and reporting, which we believe has greatly improved the rigour and clarity of our work.
We have addressed your main comments as follows:
-
Composite Score Construction: We have completely revised our methodology for calculating composite scores, moving away from the previous weighting scheme. In line with your recommendations, the new approach involves:
-
(i) Standardization: All individual test metrics were transformed and standardized into Z-scores.
-
(ii) Data-driven weights: We conducted a Confirmatory Factor Analysis (CFA) to derive factor loadings, which serve as data-driven weights for the composite scores. This process and the resulting weights are detailed in the Methods section and Table 2. This new methodology directly addresses your concerns about arbitrary weights and potential sample-specific artefacts.
-
-
Statistical Reporting and Normality:
-
Typo: Thank you for spotting the typo; "MC Score" has been corrected to "ME Score" throughout the manuscript.
-
Normality: We have conducted a new, thorough normality assessment of the final composite scores using three statistical tests and Q-Q plots, presented in Figure 1. We confirm that the Decision-Making (DM) score was not normally distributed and have highlighted this in the Abstract, Results, and Discussion sections.
-
Reliability Metrics: We have updated our test-retest reliability analysis and now report 95% Confidence Intervals for all ICC values in Table 3. We have also conducted a Pearson's correlation analysis as a complementary measure of the relationship between test sessions. In addition, as you recommended, we have now included Bland-Altman plots (Figure 2) to visually assess the agreement between test and retest sessions. These plots confirm the absence of systematic bias and show that the assessments are equally reliable across the full spectrum of athlete performance.
-
-
Test-Retest Interval and Learning Effects: We agree that the retest interval is a critical factor. Our new analysis of test-retest reliability now stratifies the sample into two groups based on the time interval between sessions (<48 hours and >48 hours). As shown in Table 3, this allows for a more transparent assessment of stability over time and addresses your concern about potentially inflated ICCs from short intervals. The results confirm that reliability remains high across both short and longer periods.
-
Positioning Novelty and External Criteria: To better position the study's contribution, we have added a new analysis to the Discussion that serves as a form of external validation. We demonstrate that the Manual Dexterity tasks are sensitive enough to detect significant performance differences between a participant's dominant and non-dominant hand, reflecting an expected real-world attribute.
-
COI/Ethics: We have revised the manuscript to provide a more transparent disclosure of our data sharing agreement and ethical considerations. A conflict of interest statement has also been clarified.
We believe these substantial revisions address your concerns and we thank you once again for your meticulous review, which has been invaluable in improving our paper.
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe following minor concerns need to be addressed:
The number of athletes and units of analysis in terms of sessions should be aligned between the Abstract and Methods, clearly indicating how many people, sessions, and cases are included in each analysis (including CFA).
What criteria were used to remove the 67 cases after winsorizing?
It is unclear what the N reported in Table 3 refers to; in particular, it is unclear whether they refer to participants, scores, or pairs of sessions.
It is appropriate to specify whether the research was exempted from ethics review and whether the athletes involved provided consent regarding the use of their anonymized data.
Author Response
Dear Reviewer 2,
Thank you again for your time and for providing these helpful points for clarification. We have addressed each below and believe the manuscript is stronger for it.
- Alignment of athlete and session numbers:
- Thank you for noting this. We have now aligned the Abstract and Methods sections to ensure the participant numbers for each stage of the analysis are clear and consistent.
- Criteria for removing 67 cases:
- We have clarified this in the Methods section. These 67 cases were identified as outliers falling outside of a 3 standard deviation threshold and were consequently excluded from the analysis.
- Clarification of 'N' in Table 3:
- You are correct to ask for clarity. The 'N' refers to the number of participants. We have updated the table caption to state this explicitly.
- Ethics review and consent:
- The Institutional Review Board statement has been updated to confirm this research was exempted from a formal ethics review. The commercial contract between INCISIV and its clients states that the organisation is responsible for gaining consent from players, as data collection is part of their mandatory pre-screening and contractual obligations.
We trust these revisions have addressed your concerns and we thank you again for your valuable input.
Reviewer 3 Report
Comments and Suggestions for AuthorsThank you for taking my comments into account.
Author Response
Thank you very much for your great review! We believe the manuscript is much stronger.
