An AI-Driven Trainee Performance Evaluation in XR-Based CPR Training System for Enhancing Personalized Proficiency
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsI am grateful for the opportunity to review the manuscript entitled “An AI-driven Trainee Performance Evaluation in XR-based CPR Training System for Enhancing Personalized Proficiency”. The authors present an interesting and relevant study in which they evaluate a system that uses artificial intelligence and extended reality to assess the success of cardiopulmonary resuscitation (CPR) training. The manuscript is well-written, the field of research is of paramount importance to the scientific community and falls within the scope of the journal. However, parts of the work require substantial revisions: the consistency of the work is inhomogeneous, as the content of the introduction, discussion, and conclusions sections in particular sometimes go far beyond what was shown in the study itself.
1.) Major issues:
It is imperative that any submitted manuscript is conclusive in itself. This means that if the study conducted in seen as the centre of the manuscript, any chapter should refer to what has been done and what has been seen. The introduction should result in a study question that can be addressed by the methods described. The methods employed should be reported in such a manner that any researcher would be able to replicate the trial and obtain the same results. The discussion provides a comprehensive contextual analysis of the results derived from the trial, also putting them into context to relevant literature. The conclusions are drawn from the results of the conducted trial.
The manuscript should be structured according to these chapters, which are firmly established in scientific literature: Introduction – Material and Methods – Results – Discussion – Conclusion. Relevant checklists might be helpful in this regard.
Between introduction and methodology, there appears a chapter containing a selective review on AI/XR in CPR training (chapter 2). This mixed approach appears unconventional, as a selective review is an independently conducted (and published) scientific publication. A review, as with any other scientific publication, should be divided into a methodology, results, discussion, and conclusion section. The findings of this study are set into context with pre-existing literature in the discussion section (which is entirely missing). The utilisation of a checklist (as mentioned above) is an effective instrument which might help to ensure consistent reporting.
No information is provided regarding the following: obtaining approval from an independent ethics committee; registering the study in a relevant database prior to the start of data collection; method of recruiting participants; and a relevant standard or checklist for reporting on a study (e.g., CONSORT).
2.) Minor issues:
The title should only contain trivial abbreviations, other abbreviations should be avoided or at least explained.
Parts of the title are not repeated as keywords.
2.1.) Abstract:
The abstract first mentions the limitations of “group public CPR training” and subsequently concludes that “the proposed system is expected to lower barriers…”. However, it should be noted that neither of these outcomes will be examined in this trial: It does neither compare AI/XR training to conventional training nor does it examine the barriers to CPR training. The abstract should sum up what has been done, and what has been observed. The Introduction should lead to the study question and the discussion/conclusion should be derived from the results achieved.
2.2.) Introduction:
The introduction is somewhat extended and should more precisely direct the reader towards the question that is under examination in this particular trial.
There is no need to go into detail about how the paper is organized at this point, as scientific publications strictly follow established structures.
It is not clear how the authors arrive at their conclusion that there are "low training outcomes and inadequate skill proficiency." The heterogeneity of skill distribution in the public is a key factor in this regard.
2.3.) Methodology:
From the technical part, the manuscript displays a systematic and appropriate design, and the illustrations are helpful in understanding the methodology.
However, the study fails to provide any information regarding the number of participants, the methodology of their recruitment, the prior medical experience they had, or the manner in which the CPR simulation scenarios were conducted. Furthermore, the absence of any explicit statement regarding the acquisition of ethical approval for this study is a notable deficiency. In the opinion of the reviewer, ethical approval is a prerequisite for a clinical study, even when the subjects are healthy.
It is not necessary to provide an explanation for the content of any given section, as the structures of scientific literature are well-established and widely accepted.
Based on the description of the methodology, it can be assumed that this refers to the training of lay rescuers in CPR (Basic Life Support, BLS) and not professional rescuers (Advanced (Cardiac) Life Support, ALS/ACLS). This assertion should be explicitly stated in the manuscript. Consequently, the utilisation of “EMT Level 2 certification protocols” as a base for verification tests have to be questioned, as these protocols are developed and validated for professional rescuers.
It is unclear how the authors arrived at a "critical 4-minute interval", as this is not included in the CPR guidelines, i.e. the joint statement from the International Liaison Committee on Resuscitation (ILCOR) or the regional guidelines, e.g. from the American Heart Association (AHA) or the European Resuscitation Council (ERC), respectively. These should also be the primary source for any definitions. (e.g., “OHCA”).
It is unclear how the authors arrived at a "critical 4-minute interval" stated in the methods section, as this is not mentioned in the aforementioned CPR guidelines.
The characteristics of the language models examined and the reasons for their use should be presented in more detail.
2.4.) Results:
The results are clearly structured, comprehensively reported and presented in an adequate manner. However, the results (as well as the methodology, see above) of the data collection from the probands during the CPR simulation scenarios are missing.
2.5.) Discussion:
As stated above, the discussion should set the methods exerted into context with the literature. Although the discussion section is completely missing, the work (as in many other areas of life) takes the tone that these new and hyped technologies are much better than anything conventional that has come before. The general consensus that these technologies are likely to revolutionize our lives is unquestionably correct. With regard to the trial, in reality, the quality of training varies considerably. Though, it is important to consider the type, mode, duration of training with which the technology utilized in this study is compared. In particular, the study design is not suitable for determining whether one type of training is “better” or “worse” than another. Due to the many confounders, a very large randomized controlled study involving an extremely high number of subjects would be necessary in order to be able to detect the effect of training on survival rates. Consequently, the study results should be interpreted with some caution and taking into account the limitations of an artificial exercise.
The differences/distinctions between language learning models and artificial intelligence should be highlighted at this point. It is important to note that the responses generated by ChatGPT and similar models do not necessarily reflect true information, but rather information that is highly likely to match the previously used word sequences. This is a potentially problematic aspect of LLM, in that it produces responses that are aligned with the expectations and beliefs of the user, irrespective of their veracity. This issue must be addressed in the discussion, as “classic” CPR trainers are skilled to evaluate participants' contributions critically and to provide (even critical) feedback.
Despite the fact that "old" CPR guidelines were still in effect at the time the study was conducted, the highly topical nature of the recent revision of the guidelines from October 2025 justifies a brief discussion of the relevant changes and statements in the "new" guidelines with regard to the study conducted.
2.6.) Conclusions:
The conclusions drawn are not necessarily derived from the methodology and results of the experiments conducted.
2.7.) Disclosures:
Were there no clinicians involved in the study? How is medical expertise ensured?
2.8.) Literature
References are properly quoted, relevant literature missing (CPR guidelines) is stated above.
3.) Formatting and Language:
The submitted manuscript is well formatted, and the figures and tables are correctly placed. Minor errors can be addressed during the publication process.
4.) Miscellaneous:
If the documentation or data collection required to facilitate an adequate understanding of the manuscript exceeds the scope of the latter, it is possible to attach it as supplementary files.
The plagiarism check revealed no issues.
5.) Scientific merit
The study itself and its results are highly relevant for the scientific community, and the field of research is highly relevant for a technical (or medical) journal.
6.) Ethics
No issues.
7.) Recommendation
After revisions of the context from introduction, discussion and conclusions, I see the potential for publication.
Author Response
We would like to express our sincere gratitude to the reviewer for the comprehensive review and valuable insights. We particularly appreciate the reviewer's recognition of the study's relevance to the scientific community. We acknowledge the reviewer's concern regarding the manuscript's structure and the consistency between the introduction/conclusion and the actual study results. We realized that our initial manuscript did not clearly distinguish between the "proposed educational goals" and the "technical verification" actually conducted. Accordingly, we have substantially revised the manuscript to clarify that this study is a technical Proof-of-Concept (PoC) using manikins to validate the system architecture, rather than a clinical trial involving human participants. We have also restructured the sections to follow standard scientific reporting.
Please find our point-by-point responses below.
Comments 1: It is imperative that any submitted manuscript is conclusive in itself. This means that if the study conducted in seen as the centre of the manuscript, any chapter should refer to what has been done and what has been seen. The introduction should result in a study question that can be addressed by the methods described. The methods employed should be reported in such a manner that any researcher would be able to replicate the trial and obtain the same results. The discussion provides a comprehensive contextual analysis of the results derived from the trial, also putting them into context to relevant literature. The conclusions are drawn from the results of the conducted trial.
Response 1: We fully agree with the reviewer. We have revised the Section 1. Introduction and Section 6. Conclusions to ensure strict consistency.
Introduction: We narrowed the scope from broad educational claims to specific technical challenges (e.g., lack of biomechanical feedback in sensors, latency in AR). The study question now focuses on "whether an AI-integrated XR system can provide real-time, accurate feedback on CPR protocols without human instructors."
Conclusions: We have rewritten the conclusion to strictly reflect the experimental results.
Comments 2: The manuscript should be structured according to these chapters, which are firmly established in scientific literature: Introduction – Material and Methods – Results – Discussion – Conclusion. Relevant checklists might be helpful in this regard.
Response 2: We have restructured the manuscript to align with this standard format. Specifically, we have added a dedicated Section 5. Discussion, which was missing in the original draft. In this section, we interpret our results in the context of existing literature and discuss the limitations of our study.
Comments 3: Between introduction and methodology, there appears a chapter containing a selective review on AI/XR in CPR training (chapter 2). This mixed approach appears unconventional, as a selective review is an independently conducted (and published) scientific publication. A review, as with any other scientific publication, should be divided into a methodology, results, discussion, and conclusion section. The findings of this study are set into context with pre-existing literature in the discussion section (which is entirely missing). The utilisation of a checklist (as mentioned above) is an effective instrument which might help to ensure consistent reporting.
Response 3: We understand the reviewer's perspective, which aligns with standard medical reporting. However, in engineering and computer science journals (such as MDPI Electronics), a dedicated "Related Works" section (Section 2) is often required to position the proposed technical architecture against existing technologies before detailing the methodology. Nevertheless, we agree that the Discussion was missing. We have now clearly separated the Related Works (Section 2) from the Discussion (Section 5). Section 2 focuses on the technical limitations of previous studies, while Section 5 interprets our findings and discusses clinical implications.
Comments 4: No information is provided regarding the following: obtaining approval from an independent ethics committee; registering the study in a relevant database prior to the start of data collection; method of recruiting participants; and a relevant standard or checklist for reporting on a study (e.g., CONSORT).
Response 4: This is a critical point that requires clarification. This study was conducted as a technical Proof-of-Concept (PoC) using manikins and internal developers to verify system functionality, latency, and sensor accuracy. It was NOT a clinical trial involving human participants or patients. Therefore, participant recruitment, IRB approval, and clinical trial registration were not applicable. We have added an explicit statement in Section 4 (Experiments and Results) and Section 5 (Discussion) clarifying that the validation was performed in a controlled laboratory environment using manikins, and that future clinical trials with human subjects will require ethical approval.
Comments 5: The title should only contain trivial abbreviations, other abbreviations should be avoided or at least explained. Parts of the title are not repeated as keywords.
Response 5: We have revised keywords aligned with the title.
Comments 6:
2.1.) Abstract:
The abstract first mentions the limitations of “group public CPR training” and subsequently concludes that “the proposed system is expected to lower barriers…”. However, it should be noted that neither of these outcomes will be examined in this trial: It does neither compare AI/XR training to conventional training nor does it examine the barriers to CPR training. The abstract should sum up what has been done, and what has been observed. The Introduction should lead to the study question and the discussion/conclusion should be derived from the results achieved.
Response 6: We have revised Abstract to focus on the technical contributions and quantitative results. Speculative statements about lowering barriers have been removed.
Comments 7:
2.2.) Introduction:
The introduction is somewhat extended and should more precisely direct the reader towards the question that is under examination in this particular trial.
There is no need to go into detail about how the paper is organized at this point, as scientific publications strictly follow established structures.
Resopnse 7: We accept the reviewer’s feedback. We have streamlined the Introduction (Section 1) to be more concise. We removed the paragraph detailing the organizational structure of the paper. We tightened the text to focus directly on the research question: "Developing a system that ensures biomechanical precision and real-time protocol feedback without human instructor."
Commnets 8: (2.2.) Introduction) It is not clear how the authors arrive at their conclusion that there are "low training outcomes and inadequate skill proficiency." The heterogeneity of skill distribution in the public is a key factor in this regard.
Response 8: We acknowledge that the phrase "low training outcomes" was ambiguous. We have revised the introduction to cite specific literature regarding the "rapid decay of CPR skills" and "limitations of non-feedback based training" rather than making generalized claims about public proficiency.
Comments 9:
(2.3.) Methodology) From the technical part, the manuscript displays a systematic and appropriate design, and the illustrations are helpful in understanding the methodology.
However, the study fails to provide any information regarding the number of participants, the methodology of their recruitment, the prior medical experience they had, or the manner in which the CPR simulation scenarios were conducted. Furthermore, the absence of any explicit statement regarding the acquisition of ethical approval for this study is a notable deficiency. In the opinion of the reviewer, ethical approval is a prerequisite for a clinical study, even when the subjects are healthy.
It is not necessary to provide an explanation for the content of any given section, as the structures of scientific literature are well-established and widely accepted.
Resopnse 9: As emphasized in our response to the Major Issues, this study was conducted as a technical Proof-of-Concept (PoC) to validate system latency, sensor accuracy, and AI logic, utilizing manikins in a laboratory setting. No external human subjects or patients were recruited for clinical trials. The "experiments" involved the authors/developers performing standard actions to verify if the sensors and AI correctly detected them. We have explicitly stated in Section 4 (Experiments and Results) and Section 5 (Discussions) that this was an internal functional test, and that future clinical trials involving human subjects will require IRB approval and formal recruitment protocols.
Comments 10: (2.3.) Methodology) Based on the description of the methodology, it can be assumed that this refers to the training of lay rescuers in CPR (Basic Life Support, BLS) and not professional rescuers (Advanced (Cardiac) Life Support, ALS/ACLS). This assertion should be explicitly stated in the manuscript. Consequently, the utilisation of “EMT Level 2 certification protocols” as a base for verification tests have to be questioned, as these protocols are developed and validated for professional rescuers.
Response 10: We confirm that the target audience is indeed laypeople. We removed "EMT Level 2". We utilize basic life support and CPR certification protocols based on ILCOR/AHA/KACPR guidelines.
Comments 11: (2.3.) Methodology)
It is unclear how the authors arrived at a "critical 4-minute interval", as this is not included in the CPR guidelines, i.e. the joint statement from the International Liaison Committee on Resuscitation (ILCOR) or the regional guidelines, e.g. from the American Heart Association (AHA) or the European Resuscitation Council (ERC), respectively. These should also be the primary source for any definitions. (e.g., “OHCA”).
It is unclear how the authors arrived at a "critical 4-minute interval" stated in the methods section, as this is not mentioned in the aforementioned CPR guidelines.
Response 11: We appreciate the correction. We have removed the specific "4-minute" claim to avoid confusion with official guidelines. In Section 1, we have rephrased this to emphasize the well-established "time-dependent" nature of survival rates, citing that the probability of survival decreases significantly for every minute treatment is delayed, without stating a rigid 4-minute cutoff.
Comments 12: The characteristics of the language models examined and the reasons for their use should be presented in more detail.
Response 12: We have expanded the description of the LLMs in Section 4.
Comments 13:
2.4.) Results:
The results are clearly structured, comprehensively reported and presented in an adequate manner. However, the results (as well as the methodology, see above) of the data collection from the probands during the CPR simulation scenarios are missing.
Response 13: As clarified in the Methodology section, there were no "probands" (clinical subjects) in this PoC study.
Comments 14:
2.5.) Discussion:
As stated above, the discussion should set the methods exerted into context with the literature. Although the discussion section is completely missing, the work (as in many other areas of life) takes the tone that these new and hyped technologies are much better than anything conventional that has come before. The general consensus that these technologies are likely to revolutionize our lives is unquestionably correct. With regard to the trial, in reality, the quality of training varies considerably. Though, it is important to consider the type, mode, duration of training with which the technology utilized in this study is compared. In particular, the study design is not suitable for determining whether one type of training is “better” or “worse” than another. Due to the many confounders, a very large randomized controlled study involving an extremely high number of subjects would be necessary in order to be able to detect the effect of training on survival rates. Consequently, the study results should be interpreted with some caution and taking into account the limitations of an artificial exercise.
Response 14: We have added a comprehensive Section 5 (Discussions) to address this. We explicitly discuss that while AI/XR offers accessibility, it cannot yet fully replicate the critical judgment of a human instructor. We positioned our system as an "auxiliary tool" to support self-training before final CPR certification with human instructor, rather than a replacement for professional certification.
Comments 15: (2.5.) Discussion) The differences/distinctions between language learning models and artificial intelligence should be highlighted at this point. It is important to note that the responses generated by ChatGPT and similar models do not necessarily reflect true information, but rather information that is highly likely to match the previously used word sequences. This is a potentially problematic aspect of LLM, in that it produces responses that are aligned with the expectations and beliefs of the user, irrespective of their veracity. This issue must be addressed in the discussion, as “classic” CPR trainers are skilled to evaluate participants' contributions critically and to provide (even critical) feedback.
Response 15: This is a critical insight. We addressed the safety of LLM-based feedback in Section 4.2 and Section 5.3. We explained that the system does not allow the LLM to generate open-ended answers. Instead, we use a strict System Prompt that restricts the AI to a "Validator" role, checking for specific Keyword Matching (e.g., "AED", "119") and Contextual Appropriateness based strictly on the KACPR guidelines. This minimizes the risk of hallucinations.
Comments 16: (2.5.) Discussion) Despite the fact that "old" CPR guidelines were still in effect at the time the study was conducted, the highly topical nature of the recent revision of the guidelines from October 2025 justifies a brief discussion of the relevant changes and statements in the "new" guidelines with regard to the study conducted.
Response 16: We have updated the manuscript to reflect the latest trends. In Section 1 (Introduction), we referenced the 2025 ILCOR and AHA guidelines, noting their emphasis on "digital transformation" and "AR and In-situ simulation" in resuscitation education. We used this to validate the relevance of our research direction.
Comments 17: ( 2.6.) Conclusions) The conclusions drawn are not necessarily derived from the methodology and results of the experiments conducted.
Response 17: We have revised Section 6. Conclusions. We removed speculative claims about educational outcomes that were not tested.
Comments 18:
2.7.) Disclosures: Were there no clinicians involved in the study? How is medical expertise ensured?
Response 18: While the authors are engineering researchers, the system's logic and evaluation criteria were rigorously programmed based on the KACPR Guidelines. We adhered to these published medical standards to ensure clinical relevance. We have acknowledged in Section 5 that future validation by professional medical instructors is a necessary next step.
Comments 19:
2.8.) Literature: References are properly quoted, relevant literature missing (CPR guidelines) is stated above.
Response 19: We have added the 2025 ILCOR/AHA Guidelines and the 2020 KACPR Guidelines to the reference list and cited them throughout the manuscript to support our definitions and experimental design.
Comments 20: 3.) Formatting and Language: The submitted manuscript is well formatted, and the figures and tables are correctly placed. Minor errors can be addressed during the publication process.
Response 20: We thank the reviewer for the positive feedback regarding the manuscript’s formatting. We have conducted a final, thorough proofreading of the revised manuscript to correct any remaining grammatical errors and ensure professional language quality throughout the text.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper describes an interesting system combining AI with XR (extensive reality) for CPR (cardiopulmonary resuscitation) training that uses sensors, computer vision, and an LLM (large language model) for voice feedback. The design has potential for improving personalization for CPR training. The architecture review is good, and the results seem generally positive.
Nonetheless, several things need to be adjusted for more clarity and scientifically improve the paper: Explain the setup of your experiments more. How many sessions did you have? How many repetitions? What kind of stats did you use for the results you reported?
Most of the evaluation is done with mannequins. This limitation is accepted and should be stated and discussed. It would be nice to have a quantitative comparison (however concise) to existing XR or AI based CPR training systems to better position the work.
Because training for CPR is safety-critical, how is the reliability and safety of the LLM based verbal feedback?
Figure clarity and language should be improved. In summary, the paper can be accepted with the mentioned minor changes.
Author Response
Comments 1: Nonetheless, several things need to be adjusted for more clarity and scientifically improve the paper: Explain the setup of your experiments more. How many sessions did you have? How many repetitions? What kind of stats did you use for the results you reported?
Response 1: We sincerely thank for pointing out the need for enhance clarity regarding the experimental setup. We have revised Section 4 (Experiments and Results) to explicitly clarify the nature of our evaluation. As this study focuses on the Proof-of-Concept (PoC) and technical feasibility of the proposed system architecture rather than a clinical user study, our experiments were designed to verify system performance metrics (latency, recognition accuracy, and sensor precision)
Comments 2: Most of the evaluation is done with mannequins. This limitation is accepted and should be stated and discussed. It would be nice to have a quantitative comparison (however concise) to existing XR or AI based CPR training systems to better position the work.
Response 2: We agree with the reviewer that relying solely on mannequins is a limitation. We have explicitly addressed this in the Discussion and Conclusions sections. We clarified that while the system was validated in a controlled environment using a manikin, results may differ in complex, unpredictable clinical settings. We also added that further validation by skilled professional CPR instructors and potential clinical trials are necessary for future practical application. This new section compares our proposed AI-XR system with existing "Sensor-based Systems" and "Conventional XR Systems" across five key dimensions: Physical Feedback, Verbal Evaluation, Tracking Method, Real-time Latency, and Instructor Need. We highlighted that unlike existing systems that focus on either just quantitative metrics or visual immersion, our system provides a comprehensive solution covering biomechanical posture correction and context-aware verbal assessment.
Comments 3: Because training for CPR is safety-critical, how is the reliability and safety of the LLM based verbal feedback?
Response 3: The reviewer raises a crucial point regarding safety. We have reinforced the explanation of our proposed system in Section 4.2 to address reliability. The system does not generate open-ended medical advice but evaluates specific "Keyword Matching" (e.g., CPR, AED) and "Contextual Appropriateness" based on fixed protocols in accordance with Korean CPR certification protocols. Furthermore, our experimental results show that the GPT-4o model achieved 88% accuracy in this controlled evaluation, confirming its feasibility as an auxiliary CPR training tool.
Comments 4: Figure clarity and language should be improved. In summary, the paper can be accepted with the mentioned minor changes.
Response 4:
We have reviewed and improved the quality of all figures to ensure the system architecture and workflows are clearly visible. Additionally, the entire manuscript has been carefully proofread to improve flow, clarity, and grammatical accuracy.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThank you very much for the opportunity to review the revised manuscript. The authors have put a great deal of effort into thoroughly revising the manuscript, which could significantly improve the quality of the publication. All of my suggestions have been incorporated or rejected with justification. I recommend that the manuscript be accepted.
