Next Article in Journal
Clinical Readiness of Additively Manufactured Dental Ceramics for Crowns, Veneers, and Partial-Coverage Restorations: A Scoping Review and Evidence Map
Next Article in Special Issue
Extended Reality Applications in Environmental Education: A Field Learning Approach to Understanding Lake Ecosystems
Previous Article in Journal
Numerical Simulation Analysis of Elbow Erosion in Underground Gas Storage Process System
Previous Article in Special Issue
Defining Abusive News Categories: Proposing a Detection Model for Digital Media Integrity
 
 
Article
Peer-Review Record

Differential Effects of Desktop and Immersive Virtual Reality on Learning, Cognitive Load and Attitudes of University Students

Appl. Sci. 2026, 16(7), 3595; https://doi.org/10.3390/app16073595
by Julio Cabero-Almenara, Mª Victoria Fernández-Scagliusi, Antonio Palacios-Rodríguez * and Rocío Piñero-Virué
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Reviewer 5: Anonymous
Appl. Sci. 2026, 16(7), 3595; https://doi.org/10.3390/app16073595
Submission received: 16 February 2026 / Revised: 23 March 2026 / Accepted: 26 March 2026 / Published: 7 April 2026
(This article belongs to the Special Issue Advanced Technologies Applied in Digital Media Era)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

First, I would like to thank the journal editor for the opportunity to review this manuscript.


This manuscript aims to investigate whether interactive virtual reality (VR) learning objects can facilitate knowledge acquisition, and to compare two presentation modalities—immersive (HMD) and desktop—in terms of learning performance, cognitive load, and learning attitudes. The study employed a randomized pretest–posttest experimental design with 136 education students (immersive n = 70; desktop n = 66). Primary outcome measures comprised a learning achievement test, a modified NASA-TLX cognitive load scale, and a semantic-differential based attitude scale. Results indicate that both groups showed significant improvements in learning outcomes overall, but there was no statistically significant difference in achievement between the immersive and desktop conditions; the immersive condition produced slightly higher subjective low–medium cognitive load than the desktop condition; and both groups reported very positive attitudes toward VR. The authors therefore conclude that both modalities can serve as effective and acceptable instructional tools in higher education, while immersive VR may require additional cognitive investment. They call for further work to clarify moderating factors and to develop theory-driven instructional design frameworks for VR.

 

Revision suggestions

1.The manuscript presents most analyses in table form and contains relatively few graphical displays. I recommend replacing some tables with boxplots or estimation plots (means with 95% CIs) to facilitate readers’ assessment of effect direction and uncertainty.

2.The description of the participant sample is overly brief. Please provide detailed participant demographics (sex, age distribution, prior VR experience/use frequency, and any visual or vestibular/motion-sickness conditions) and discuss how these characteristics might influence the results.

3.Include an a priori or post hoc power analysis in the Methods or Appendix to clarify whether the sample size (n = 136) is sufficient to detect the effect sizes of theoretical or practical interest.

4.The 25 multiple-choice items used for the learning test are insufficiently described. Please report the source or item-development procedure, provide item-level statistics (item difficulty and discrimination), and report overall test reliability. Explain how the test content maps to the instructional objectives.


Recommend publication after revision.

Author Response

Please see the attached file for your reference.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors correctly identified the types of virtual reality used in the teaching-learning process: fully immersive, semi-immersive, and non-immersive (desktop). Continuing the analysis of the current state, the authors abandoned the semi-immersive type as being of little significance and focus on the other two. Afterwards, the analysis used criteria such as: usability, accessibility, simplicity, degree of complexity, motivation, visual fatigue, and performance acquisition. The influence of cognitive load on the success of the learning process in the case of the two types is treated in a separate sub-chapter.

The objectives of research are formulated in conjunction with the niches identified during the analysis of state of the art. The objectives are basically the following research questions:

  • Are learning objects made in VR format (immersive or desktop) favouring the acquisition of knowledge?
  • Which is the cognitive load associated to learning using VR educational materials (deployed on immersive and desktop)?
  • Which are the attitudes that learning using virtual reality awakens (deployed on immersive and desktop)?

The text of “2.3. Research design” need a little refinement in chronology description. First should be described the pre-test of students in immersive condition and afterwards the post-test, followed the pre-test of students in desktop condition… If the reviewer understood it wrong, it is an indication that the refinement is really needed.

One of the information collection instruments was “a multiple-choice test was constructed consisting of 25 items of four answer options with one valid choice” aimed to measure performance. It would be beneficial to present the test in an appendix or as a fragment in sub-chapter 2.4 to allow the reader to better understand what was done. The same recommendation is for the ad-hoc instrument created for attitude analysis or to insert a reference to Table 3.

The analysis of cognitive load was carried out using the NASA-TLX questionnaire – a reliable, widely used, and well-known tool to specialists.

Because there are quite a variety of readers with different degrees of focus, I suggest (not recommend) to take into consideration the elimination of the first two word-clouds (total) in Figure 2. Also, I recommend describing how the word clouds were made (instruments, etc.).

All the results presented in the manuscript were tested with the proper statistical instruments and the output was interpreted correctly. So, the conclusions are based on reliable information. Also, the conclusions are reported in relation with research of other authors. The limitations of the present research are indicated. The practical relevance of the present research is properly indicated.

The text of the manuscript is fluent, well-structured, and clear in almost all its length. (An improvement was suggested above.)

The references are relevant for the field of research and most of them are quite new (from the 2020s). There are 6 self-citations, but they are justified and also indicate the competence of the authors in the subject.

The figures and tables are clear, easy to understand, properly show the data, and reflect the research. Just one recommendation: For a better readability, please make wider the columns of Table 4.

Author Response

Please see the attached file for your reference.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This manuscript addresses a timely and relevant topic, contributing to the ongoing discussion regarding the comparative effects of immersive and desktop virtual reality in higher education.The literature review is current and generally comprehensive, and the study design is appropriate for the research questions posed. However, substantial revisions are required before the manuscript can be considered for publication.

First, the abstract would benefit from more moderate phrasing (e.g., avoiding strong claims such as “difficult to recreate”) and slightly more cautious interpretations of findings. Throughout the manuscript, significant improvement in grammar, syntax, and overall language quality is necessary, as numerous minor errors (e.g., unnecessary punctuation, double spaces, inconsistent phrasing) and weak transitions affect readability. The flow between paragraphs is often fragmented, and several sections (particularly within the Introduction) read as loosely connected summaries of prior studies rather than a coherent theoretical narrative.

Structurally, the Introduction should be condensed and conclude clearly with the study’s general aim, while current subsections (1.1, 1.2, 1.3) could be reorganized into a separate Theoretical Framework section.

In the Methods section (currently titled “The Investigation”), additional detail regarding the sample is needed, and section 2.4 requires clearer writing and stronger coherence.

The Results section is statistically rigorous and well-executed. Ηowever, Tables 2, 3, and 4 are excessively large and could be summarized in text or moved to an appendix. Similarly, Figure 2 could be resized or relocated.

Finally, Section 4 should be reorganized into a clearer Discussion (including limitations and future research directions), followed by a concise Conclusions section summarizing the main findings and implications.

While the study is methodologically sound and responds to its research questions effectively, these structural, linguistic, and presentation issues warrant major revision.

Comments on the Quality of English Language

The manuscript requires substantial language revision. While the overall meaning is generally understandable, there are numerous grammatical errors, awkward sentence constructions, punctuation inconsistencies, and formatting issues (e.g., unnecessary full stops, double spaces, inconsistent phrasing) throughout the text. In several sections, particularly in the Introduction and Methodology, the flow of ideas is fragmented and transitions between paragraphs are weak, which affects clarity and readability. Certain sentences are overly long or syntactically imprecise, and some expressions would benefit from more moderate academic phrasing.

I strongly recommend a thorough professional English language editing process to improve grammar, coherence, academic tone, and overall readability before the manuscript can be considered for publication.

Author Response

Please see the attached file for your reference.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

This work represent a significant tool for educaction professionals that wants to improve te learning process throw the use of inmersive reality. The main goal of interaction with VR-based learning objects shows the acquisition of knowledge by students.  The test applied to students interacting with objects in VR in an immersive or desktop way are well repreented by data tables and images. 

Author Response

Thank you

Reviewer 5 Report

Comments and Suggestions for Authors

Dear Authors,

Thank you for submitting your manuscript for review. The study addresses a relevant comparison between immersive VR and desktop (non-immersive) VR in terms of performance, cognitive load, and attitudes, using a pretest–posttest design and inferential analyses including ANCOVA. The topic is timely for applied research in educational technology. However, the current version contains methodological and reporting issues that prevent a full audit of the evidence and limit reproducibility. The comments below prioritize changes that will strengthen internal validity, analytical transparency, and verifiability.

Critical comments

[Critical] 1) Intervention and procedure are not described at a reproducible level

  • Finding: The manuscript does not allow readers to reconstruct precisely what participants did, under which instructions, and for how long in each condition.
  • Location: Section 2.3 Research design (p. 4, ~lines 156–163): design and hardware are reported, but the learning object, tasks, timing, and session protocol are not.
  • Impact: Without standardization (time-on-task, scaffolding, completion criteria), confounding cannot be ruled out and both internal validity and reproducibility are weakened.
  • Improvement: Add an “Intervention and procedure” subsection detailing: learning objectives, task script, duration per phase, exact instructions, time control, allowed assistance, room conditions, and incident logs (technical failures, dropouts, cybersickness).

[Critical] 2) Sample is under-characterized; baseline variables and inclusion/exclusion criteria are missing

  • Finding: n=136 and random assignment (66/70) are reported, but baseline characteristics and eligibility criteria are not.
  • Location: Section 2.2 The research sample (p. 4, ~lines 152–155) and 2.3 (p. 4, ~line 156).
  • Impact: Without baseline profiling (age, gender, prior VR/gaming experience, cybersickness susceptibility, vision), group comparability and external validity cannot be evaluated.
  • Improvement: Add a baseline table by group and report inclusion/exclusion criteria, attrition, and a simple participant flow.

[Critical] 3) Performance test (25 MCQs) lacks psychometric evidence and is vulnerable to practice effects

  • Finding: A 25-item MCQ test is used and only the item order is changed between pre/post; reliability and item properties are not reported.
  • Location: Section 2.4 Instrument (p. 4, ~lines 168–172).
  • Impact: If items are identical, changing order does not remove practice/memory effects, threatening the interpretation of learning gains.
  • Improvement: Clarify whether parallel forms were used; report KR-20/α, item difficulty/discrimination, and content validity (item–objective mapping).

[Critical] 4) NASA-TLX reporting inconsistency: incorrect dimension labeling and unclear global score computation

  • Finding: In Table 2, the first item is labeled “Physical demand” but the description corresponds to mental/perceptual activity; the raw vs weighted TLX and “Global” computation are not specified.
  • Location: NASA-TLX description (p. 5, ~lines 173–187) and Table 2 (p. 5–6).
  • Impact: This threatens construct validity and makes cross-condition comparisons ambiguous.
  • Improvement: Correct dimension labels; explicitly state the TLX variant (raw vs weighted) and provide the “Global” scoring formula.

[Critical] 5) Table 3 (semantic differential) is not auditable: column structure is incorrect and items are duplicated

  • Finding: The table shows inconsistent headers (multiple “Standard Deviation”) and repeated/duplicated rows; immersive vs non-immersive columns are unclear.
  • Location: Table 3 (p. 6–7).
  • Impact: Results cannot be verified, undermining credibility of the attitude analysis.
  • Improvement: Rebuild Table 3 with a fixed structure: item + mean/SD per condition + Δ (and preferably CI95%); remove duplicates and list the 26 final pairs once.

[Critical] 6) ANCOVA reporting is incomplete and contains a visible editorial error

  • Finding: Table 4 contains the label “Good luck” and lacks essential components (error df, SS/MS, effect size, assumption checks).
  • Location: Table 4 (p. 9, ~lines 245–248).
  • Impact: The adequacy of the model and inference cannot be evaluated; the editorial artifact damages formal trust.
  • Improvement: Replace Table 4 with a standard ANCOVA table (Source, SS, df, MS, F, p, partial η²) and report ANCOVA assumptions (homogeneity of regression slopes, residual diagnostics, homoscedasticity).

[Critical] 7) Cognitive load results: numerical inconsistency and lack of effect magnitude

  • Finding: The text reports means 4.24 vs 4.34, whereas Table 2 reports “Global” 4.27 vs 4.36; effect size is not reported despite a small mean difference.
  • Location: Results text (p. 9, ~lines 254–261) and Table 2 (p. 6).
  • Impact: Traceability of values is compromised; practical significance cannot be judged without effect size/CI.
  • Improvement: Align values (clarify adjusted vs raw means) and report effect sizes (partial η² or d) with CI95%.

[Critical] 8) Causal language is used for correlational evidence

  • Finding: The manuscript states “when cognitive load increases, attitude decreases” based on correlations.
  • Location: p. 10, ~lines 277–280 (text after Table 6).
  • Impact: This exceeds what correlational analysis can support (association ≠ causation).
  • Improvement: Rephrase as “a negative association was observed…” and state limitations/possible omitted variables.

Major comments

[Major] 9) Objective–measure alignment: insufficient operational definitions for key constructs

  • Finding: Objectives cover knowledge, cognitive load, and attitudes, but operational definitions and scoring rules are not fully explicit.
  • Location: 2.1 Objectives (p. 4, ~lines 140–151) and Tables 2–3.
  • Impact: Weakens construct traceability and the logic linking objectives to evidence.
  • Improvement: Provide explicit operational definitions and scoring rules for each variable.

[Major] 10) Discussion attributes mechanisms (hotspots, guides) that are not documented in Methods

  • Finding: Low cognitive load is attributed to specific design elements (guides, hotspots, spatial organization) not described in Methods.
  • Location: Discussion (p. 11, ~lines 319–323).
  • Impact: The explanation may be plausible but is not supported by documented intervention details.
  • Improvement: Either document these elements in Methods or frame them explicitly as interpretive hypotheses.

[Major] 11) t-test reporting for attitudes is incomplete

  • Finding: Table 7 provides means/SD and a t value but omits df, p, and effect size.
  • Location: p. 10, Table 7 and surrounding text.
  • Impact: Limits interpretation of “no difference” and prevents assessment of power/magnitude.
  • Improvement: Report t(df), p, Cohen’s d, and CI95%.

[Major] 12) Data handling and quality control are not described

  • Finding: No statement on missing data, exclusions, or cleaning rules.
  • Location: Methods (Section 2 overall).
  • Impact: Reduces transparency and can bias results if unreported exclusions occurred.
  • Improvement: Add explicit data QC procedures and missing-data handling rules.

Minor comments

[Minor] 13) Word-cloud figure needs a clear construction method

  • Finding: Word clouds are shown but the processing pipeline is not described.
  • Location: Figure 2 (p. 8).
  • Impact: Currently illustrative rather than analytical.
  • Improvement: Describe preprocessing (tokenization, stopwords, weighting, thresholds) and what each cloud represents.

Concrete strengths

S1) Comparative design with pretest–posttest and covariate use

  • Location: Section 2.3 and ANCOVA section (p. 9).
  • Why it strengthens the work: Using pretest as a covariate can reduce baseline variability if assumptions are met.
  • Action: Keep ANCOVA but report it fully and verify assumptions.

S2) Reporting both Cronbach’s α and McDonald’s Ω

  • Location: Table 1 (p. 5, ~lines 199–204).
  • Why it strengthens the work: Improves psychometric transparency.
  • Action: Complement with explicit scoring definitions and corrected tables.

 

The topic and overall design are promising, but the current version has critical reproducibility/reporting issues that must be resolved.

Prioritized action list

  1. Expand intervention/procedure details (tasks, timing, standardization).
  2. Rebuild Tables 3 and 4 (and correct Table 2) into auditable formats.
  3. Complete statistical reporting (ANCOVA assumptions, effect sizes, CIs; full t-test).
  4. Ensure numerical consistency and clarify raw vs adjusted means.
  5. Strengthen the performance test evidence (reliability/item analysis; practice-effect control).
  6. Remove causal wording for correlational findings.

 

Comments on the Quality of English Language

The manuscript is readable, but the English requires improvement to ensure conceptual precision, terminological consistency, and editorial credibility. My assessment distinguishes between (i) template-related artifacts (expected in a “FOR PEER REVIEW” submission) and (ii) language/style issues that directly affect scientific clarity.

  1. Template and editing artifacts that must be removed in the revised version. The manuscript contains non-scientific fragments embedded in results/tables (e.g., the “Good luck” label in the ANCOVA table).
  2. Terminological consistency and semantic precision. The manuscript alternates and overlaps terms such as “VR,” “immersive/non-immersive,” and “XR” without a stable operational definition. This is not merely stylistic: in a comparative study, each condition must be labeled consistently to avoid interpretive ambiguity. A single terminological convention should be established (primary acronym + first-use definition) and maintained throughout.
  3. Scientific phrasing and inferential alignment. Some passages use wording that implies directionality/causality (e.g., “when X increases, Y decreases”) while the evidence is correlational; here the issue is partly linguistic, and the phrasing should be aligned with the analytic scope (association vs causation) to avoid over-interpretation.
  4. Table labels and formal clarity. Beyond content issues, several table headings/labels show linguistic inconsistencies that hinder readability (e.g., dimension descriptors in NASA-TLX and repeated headers in the semantic differential table). These require technical language editing (not only grammar) to ensure readers can unambiguously identify variables, scales, and experimental conditions.

Perform an academic copyediting pass focused on: (i) removing template/editing artifacts; (ii) normalizing terminology with operational definitions at first mention; (iii) revising potentially causal phrasing into design-appropriate language; and (iv) correcting table headings/labels to ensure consistency between text, tables, and measures.

Author Response

Please see the attached file for your reference.

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

Thank you for the careful and comprehensive revision of the manuscript. The responses provided demonstrate that the authors have taken the reviewer’s comments seriously and have addressed them effectively. The abstract has been appropriately revised with more cautious wording, and the overall language quality of the manuscript has been substantially improved. Structural modifications—such as reorganizing the former subsections of the Introduction into a dedicated Theoretical Framework section, clarifying the study aim at the end of the Introduction, and restructuring the Discussion and Conclusions sections—have significantly enhanced the coherence and readability of the paper. In addition, the improvements made to the methodology section, particularly the clearer presentation of the participants and instruments, contribute positively to the transparency of the research design. The adjustments to the tables and figures, as well as the removal of redundant elements, have also improved the overall presentation of the results. Overall, the revised manuscript successfully addresses the previously raised concerns and is now considerably clearer and better structured.

Comments on the Quality of English Language

The quality of the English language has been significantly improved in the revised manuscript. The authors have carefully addressed the previously identified issues related to grammar, punctuation, sentence structure, and overall clarity. The revisions have improved the flow between paragraphs, reduced awkward phrasing, and strengthened the academic tone of the text. As a result, the manuscript is now much clearer and more readable. Further minor language polishing before final publication could still be beneficial, but the current version demonstrates a clear and successful effort to address the language concerns raised in the previous review.

Author Response

Thank you very much for your positive evaluation and for the constructive feedback provided throughout the review process. We sincerely appreciate the time and effort you have dedicated to assessing our manuscript.

We are pleased that the revisions have satisfactorily addressed the concerns raised in the previous review. The manuscript has been further refined and improved in its final version following your suggestions.

Thank you again for your valuable comments and for helping us strengthen the quality and clarity of our work.

Reviewer 5 Report

Comments and Suggestions for Authors

Dear Authors,

Thank you for the revised version of the manuscript and for the detailed response to the previous review round. This second version shows visible improvements compared with the earlier submission, particularly in the restructuring of several tables, the clearer operational definition of some variables, and the correction of inconsistencies that previously limited the basic traceability of the results. I also appreciate that a number of methodological limitations are now stated more explicitly than before.

That said, the revised manuscript still contains weaknesses that affect procedural reproducibility, measurement validity, and the inferential strength of several central claims. My overall assessment is therefore that the paper has improved, but it still requires substantial revision before it can be considered scientifically robust for acceptance.

Critical evaluation of the manuscript

1) Critical (must-fix)

Finding: The inferential logic regarding “knowledge acquisition” remains stronger than the evidence actually presented.
Location: Abstract; Section 4.3; Discussion; Conclusions.
Impact: The manuscript states that interaction with the VR learning object “produced significant knowledge acquisition” and that there were “significant knowledge gains in both conditions,” yet the reported analysis in Table 6 is an ANCOVA with posttest as the dependent variable and pretest as the covariate. Such a model estimates the effect of pretest and condition on adjusted posttest scores, but it does not by itself constitute a direct demonstration of within-subject gain unless full pre/post descriptive statistics by group or an explicit change analysis are provided. In its current form, the wording goes beyond what the reported analysis clearly supports.
Improvement: Rephrase these statements so that they remain strictly aligned with the analysis reported. If you wish to sustain claims about “learning gains,” you should provide at least the pretest and posttest means and standard deviations by condition and either report an explicit change analysis or justify methodologically why ANCOVA is interpreted here as sufficient evidence of improvement.

2) Critical (must-fix)

Finding: The description of the experimental procedure has improved, but it still does not reach a fully reproducible level.
Location: Section 3.3.
Impact: The manuscript now distinguishes three phases—pretest, intervention, and posttest—which improves methodological readability. However, essential elements are still missing for precise reconstruction of what participants actually did: exact session length, effective exposure time, task sequence, instructional script, completion criteria, allowed assistance, classroom organization, number of students per session, and handling of technical incidents or interruptions. Without these details, internal validity and reproducibility remain limited.
Improvement: Add a concise but precise operational protocol within Section 3.3, specifying total duration and phase duration, instructions provided, permitted/prohibited assistance, environmental control, criteria for ending the activity, and incident handling.

3) Critical (must-fix)

Finding: The performance test remains the weakest methodological component of the manuscript.
Location: Section 3.4.1; Table 1; Section 5.1.
Impact: Including the full set of 25 items improves content auditability, but it does not resolve the central problem: the instrument still lacks minimal psychometric evidence to support the performance construct with sufficient rigor. In addition, the same items were used in pretest and posttest, with only the item order altered. This design choice leaves open the possibility of practice or recall effects, as the manuscript itself acknowledges in Limitations. Because performance is a central outcome variable, this weakness directly affects the validity of the conclusions.
Improvement: If available, report basic test quality indicators (e.g., item difficulty, discrimination, internal consistency, or at least a content-validity rationale). If such data are not available, moderate the strength of the learning-related claims and state this limitation explicitly in the instrument subsection, not only in Limitations.

4) Critical (must-fix)

Finding: The semantic differential scale has improved formally, but it still shows conceptual and tabular consistency problems.
Location: Section 3.4.3; Table 2; Table 5.
Impact: The manuscript states that the scale contains 26 bipolar adjective pairs, yet the visible Table 2 lists 24 rows and Table 5 presents 26 analytical items with clear repetitions (“Pleasant–Unpleasant,” “Ineffective–Effective,” “Complicated–Simple,” “Valuable–Worthless”) and item pairs that do not exactly match the definition table (“Educational–Pernicious” versus “Harmful–Educational”; “Wonderful–Horrific” versus “Horrible–Wonderful”). Although the reported internal consistency is high, this inconsistency compromises instrument auditability and the interpretability of the total attitude score.
Improvement: Clean and reconcile the scale documentation and present one definitive master list containing exactly the 26 adjective pairs actually used in the analysis, in the same order and orientation as in the results table. If there was linguistic adaptation or item merging/removal, this should be explained.

5) Major (should-fix)

Finding: Sample characterization remains insufficient for assessing baseline comparability and external scope.
Location: Section 3.2; Section 5.1.
Impact: The revised manuscript now includes sex distribution in addition to sample size and random assignment, which is an improvement. However, the manuscript itself acknowledges that age, prior VR experience, and cybersickness susceptibility were not systematically collected. In a study comparing immersive and non-immersive modalities, these are not secondary variables; they may plausibly moderate experience, cognitive load, and attitudinal evaluation. Their absence limits both group comparability and interpretation.
Improvement: Include a baseline table with all available participant characteristics by group. If key variables were not measured, state this directly in Methods and not only in Limitations.

6) Major (should-fix)

Finding: ANCOVA reporting is improved, but the analysis remains incomplete because formal assumption checks are not reported.
Location: Section 4.1; Section 4.3; Section 5.1; Table 6.
Impact: The revised version improves the ANCOVA table by reporting estimated SS, df, estimated MS, F, p, and partial η². This is a meaningful improvement. However, the manuscript also acknowledges that formal checks of homogeneity of regression slopes, residual normality, and homoscedasticity were not reported. This omission is not trivial, because several key conclusions rely on this model.
Improvement: If these checks were performed, report them explicitly. If they were not performed, moderate the inferential strength of the ANCOVA-based interpretations and state clearly that those results should be interpreted with caution.

7) Major (should-fix)

Finding: The interpretation of cognitive load is reasonable, but there is still a gap between the observed result and the explanatory account offered.
Location: Section 4.1; Discussion.
Impact: The correction of the NASA-TLX description and the clarification that the Raw NASA-TLX variant was used represent a substantial methodological improvement. However, the Discussion still attributes the cognitive-load pattern to specific design elements of the learning object—such as viewing guides, informative hotspots, and spatial organization—without those elements being operationally described in sufficient detail in Methods or experimentally manipulated as factors in the design. You do introduce an inferential caution, which is positive, but the explanatory passage still leans on mechanisms not directly documented within the study.
Improvement: Reduce the explanatory strength of this passage by one further level and present those design elements as plausible interpretive hypotheses rather than study-established factors.

8) Major (should-fix)

Finding: The correspondence between text, tables, and conclusions has improved, but some presentation choices still reduce analytical precision.
Location: Table 4; Table 5; Table 6; Conclusions.
Impact: Table 4 reports the global cognitive-load score without a standard deviation, while the text places substantial interpretive weight on the global score. Table 5 reports item-level means, but the narrative interpretation moves quickly toward the total attitude score without explaining why the total score is preferable to the semantic pattern across items. In the Conclusions, the overall direction of findings is summarized adequately, but several statements still sound more definitive than warranted by the design limitations and by the unresolved measurement issues surrounding the performance test.
Improvement: Add a brief rationale for why the interpretation prioritizes global indices and revise the Conclusions so that they remain strictly proportional to the available evidence.

9) Minor (nice-to-have)

Finding: Figure 2 continues to have mainly illustrative rather than analytical value.
Location: Figure 2 and the introductory text in Section 4.2.
Impact: The revised caption improves transparency somewhat by noting stopword removal and frequency-based weighting. However, the figure still does not contribute strong analytical evidence, and its construction through a generative tool is described too generally for reproducible evaluation. In its current form, it functions more as a complementary visualization than as a substantive scientific result.
Improvement: Either explicitly frame the figure as illustrative, or describe the processing pipeline more precisely. If it does not contribute to the central argument, removal would also be reasonable.

10) Minor (nice-to-have)

Finding: Editorial template elements are still present in the revised manuscript.
Location: Front matter; final section of the manuscript; provisional DOI; “Author Contributions”; reference-format instruction text.
Impact: These elements do not directly affect the results, but they weaken the formal presentation of the paper and convey the impression of incomplete editorial preparation. In a second review round, this type of oversight reduces confidence in the manuscript’s overall readiness.
Improvement: Remove all placeholders and template text before resubmission.

Real strengths of the manuscript

11) Major (should-fix, but as a strength to consolidate)

Finding: The revision resolves an important transparency problem in the measurement of cognitive load.
Location: Section 3.4.2; Table 4.
Impact: The explicit definition of the six NASA-TLX dimensions and the clarification that the Raw NASA-TLX variant was used substantially improve construct interpretability. This correction addresses a meaningful weakness of the previous version and strengthens the traceability of the cognitive-load findings.
Improvement: Retain this improvement and, to consolidate it, add a short justification for prioritizing the global score over inferential testing of the individual dimensions.

12) Major (should-fix, but as a strength to consolidate)

Finding: The manuscript now acknowledges major methodological limitations with greater analytical honesty.
Location: Section 5.1.
Impact: Explicitly stating the absence of baseline variables, the repeated-item pre/post design, the lack of item-level psychometric analysis, and the lack of reported ANCOVA assumption checks improves the scientific transparency of the paper. This does not by itself solve those weaknesses, but it does improve the honesty and interpretability of the report.
Improvement: Carry some of this caution into Results and Conclusions so that interpretive restraint is not confined only to the Limitations subsection.

The manuscript is improved relative to the previous version, especially in the presentation of instruments, the partial correction of numerical/reporting inconsistencies, and the clearer acknowledgment of limitations. However, important methodological and inferential weaknesses remain: the evidence for performance still rests on an insufficiently supported instrument, the experimental procedure is not yet described at a fully reproducible level, the attitudinal scale documentation still requires reconciliation, and several claims about learning exceed what the reported analysis demonstrates unequivocally.

Prioritized action list

  1. Rephrase the claims about “knowledge acquisition” so that they do not exceed the reported evidence, or add explicit descriptive/analytical evidence of pre/post change.

  2. Complete the procedural description with enough operational detail for reproducibility.

  3. Strengthen the validity argument for the performance test, or clearly moderate the conclusions tied to that outcome.

  4. Resolve the inconsistency between the definition and the reporting of the semantic differential scale.

  5. Report ANCOVA assumption checks or explicitly condition the interpretation on their absence.

  6. Revise the Discussion and Conclusions so that every explanatory claim remains strictly within the scope of the design and evidence.

Comments on the Quality of English Language

The quality of the English has improved in this second revision, and the manuscript is generally understandable. However, it still requires additional language editing before publication.

The main issue is no longer overall intelligibility, but scientific precision in the writing. In several parts of the manuscript, especially in Results, Discussion, and Conclusions, some statements sound more categorical than the design and data support. In addition, terminological inconsistencies remain across expressions such as desktop, non-immersive, and desktop VR, which weakens textual cohesion. There are also overly long sentences and passages that still read like literal translation, particularly in the discussion.

I recommend a final English revision focused on four points:
(1) standardizing terminology throughout the manuscript;
(2) calibrating the strength of claims so that it matches the evidence; and
(3) splitting long sentences to improve clarity and precision.

In its current form, the English can still be substantially improved to communicate the scientific content of the study more clearly and rigorously.

Author Response

Response to Reviewer Comments

 

1. Summary

 

 

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files.

2. Point-by-point response to Comments and Suggestions for Authors

Comments 1: 1) Critical (must-fix)

Finding: The inferential logic regarding “knowledge acquisition” remains stronger than the evidence actually presented.

Location: Abstract; Section 4.3; Discussion; Conclusions.

Impact: The manuscript states that interaction with the VR learning object “produced significant knowledge acquisition” and that there were “significant knowledge gains in both conditions,” yet the reported analysis in Table 6 is an ANCOVA with posttest as the dependent variable and pretest as the covariate. Such a model estimates the effect of pretest and condition on adjusted posttest scores, but it does not by itself constitute a direct demonstration of within-subject gain unless full pre/post descriptive statistics by group or an explicit change analysis are provided. In its current form, the wording goes beyond what the reported analysis clearly supports.

Improvement: Rephrase these statements so that they remain strictly aligned with the analysis reported. If you wish to sustain claims about “learning gains,” you should provide at least the pretest and posttest means and standard deviations by condition and either report an explicit change analysis or justify methodologically why ANCOVA is interpreted here as sufficient evidence of improvement.

Response 1: We thank the reviewer for this important observation. We fully agree that an ANCOVA with posttest as the dependent variable and pretest as the covariate does not, by itself, directly demonstrate within-subject learning gains. To address this, we have taken two actions.

First, a new Table 7 (Pretest and Posttest Descriptive Statistics by Condition) has been added in Section 4.3, presenting pretest and posttest means and SDs by condition, gain scores, and one-sample t-tests on those gains (Hâ‚€: gain = 0):

— Desktop VR: Mpre = 22.44 (SD = 2.54), Mpost = 23.53 (SD = 1.37), Gain M = 1.09 (SD = 2.50), t(65) = 3.540, p = .001

— Immersive VR: Mpre = 22.14 (SD = 2.29), Mpost = 22.93 (SD = 2.23), Gain M = 0.79 (SD = 2.13), t(69) = 3.083, p = .003

— Total: t(135) = 4.701, p < .001

Second, all claims about "knowledge gains" have been rephrased throughout (Abstract, Section 4.3, Discussion, Conclusions) to remain strictly aligned with the reported analyses. The Conclusions now read: "performance scores increased significantly from pretest to posttest in both conditions [...] with no statistically significant difference between modalities in adjusted posttest scores." The Discussion opens with: "The results suggest that VR is associated with improved posttest performance [...] though the inferential strength of this claim is constrained by the measurement limitations described in Section 5.1." The possibility of practice or recall effects is acknowledged in Sections 3.4.1, 4.3, and 5.1.

 

Comments 2: 2) Critical (must-fix)

Finding: The description of the experimental procedure has improved, but it still does not reach a fully reproducible level.

Location: Section 3.3.

Impact: The manuscript now distinguishes three phases—pretest, intervention, and posttest—which improves methodological readability. However, essential elements are still missing for precise reconstruction of what participants actually did: exact session length, effective exposure time, task sequence, instructional script, completion criteria, allowed assistance, classroom organization, number of students per session, and handling of technical incidents or interruptions. Without these details, internal validity and reproducibility remain limited.

Improvement: Add a concise but precise operational protocol within Section 3.3, specifying total duration and phase duration, instructions provided, permitted/prohibited assistance, environmental control, criteria for ending the activity, and incident handling.

Response 2: We thank the reviewer for this detailed list of missing procedural elements. Section 3.3 has been substantially expanded. The revised text now specifies:

— Duration: "Each session lasted approximately 50 minutes in total: 5 minutes for instructions and setup, 30–35 minutes for interaction with the VR learning object, and 10–15 minutes for completion of the posttest questionnaires."

— Instructions: "the same standardized instructions provided by the instructor."

— Permitted assistance: "No external assistance or additional resources were permitted during the task."

— Technical incidents: "Students in the immersive condition who experienced technical difficulties were assisted by a designated technician; in all cases, the interruption was resolved within the session and participants completed the full protocol."

We acknowledge that certain additional elements — such as a classroom layout diagram, the full instructional script, and the number of students per session — were not systematically documented during data collection and cannot be reported retrospectively. This is noted as a residual reproducibility limitation.

 

Comments 3: 3) Critical (must-fix)

Finding: The performance test remains the weakest methodological component of the manuscript.

Location: Section 3.4.1; Table 1; Section 5.1.

Impact: Including the full set of 25 items improves content auditability, but it does not resolve the central problem: the instrument still lacks minimal psychometric evidence to support the performance construct with sufficient rigor. In addition, the same items were used in pretest and posttest, with only the item order altered. This design choice leaves open the possibility of practice or recall effects, as the manuscript itself acknowledges in Limitations. Because performance is a central outcome variable, this weakness directly affects the validity of the conclusions.

Improvement: If available, report basic test quality indicators (e.g., item difficulty, discrimination, internal consistency, or at least a content-validity rationale). If such data are not available, moderate the strength of the learning-related claims and state this limitation explicitly in the instrument subsection, not only in Limitations.

Response 3: We acknowledge this as the most substantive methodological limitation of the study. We have addressed it at two levels.

First, a validity limitation note has been added directly within Section 3.4.1 (not only in Limitations): "item-level psychometric evidence (difficulty indices, discrimination indices, and internal consistency) is not available for this instrument, and the use of identical items across administrations limits the capacity to rule out practice or memory effects entirely; these constitute limitations of the performance measure, as discussed in Section 5.1."

Second, this limitation is carried into the Results (Section 4.3): "given that the same items were used at pretest and posttest with only item-order randomization, practice or recall effects cannot be fully excluded." It also appears in the closing sentence of the Conclusions: "These conclusions should [...] be read in light of the methodological limitations of the study — particularly the use of identical test items across pretest and posttest, the absence of item-level psychometric validation, and the lack of formal ANCOVA assumption checks — which constrain the inferential strength of the performance-related findings."

Item-level psychometric data were not computed during data collection and are genuinely unavailable for this report. Future replications should include a full item analysis (difficulty, discrimination, KR-20) prior to use.

 

Comments 4: 4) Critical (must-fix)

Finding: The semantic differential scale has improved formally, but it still shows conceptual and tabular consistency problems.

Location: Section 3.4.3; Table 2; Table 5.

Impact: The manuscript states that the scale contains 26 bipolar adjective pairs, yet the visible Table 2 lists 24 rows and Table 5 presents 26 analytical items with clear repetitions (“Pleasant–Unpleasant,” “Ineffective–Effective,” “Complicated–Simple,” “Valuable–Worthless”) and item pairs that do not exactly match the definition table (“Educational–Pernicious” versus “Harmful–Educational”; “Wonderful–Horrific” versus “Horrible–Wonderful”). Although the reported internal consistency is high, this inconsistency compromises instrument auditability and the interpretability of the total attitude score.

Improvement: Clean and reconcile the scale documentation and present one definitive master list containing exactly the 26 adjective pairs actually used in the analysis, in the same order and orientation as in the results table. If there was linguistic adaptation or item merging/removal, this should be explained.

Response 4: We thank the reviewer for identifying this inconsistency precisely. We have fully resolved it. The source of the problem was that Table 2 had been constructed with 24 rows — inadvertently omitting two items — and the labels in Table 5 did not match the original instrument exactly.

In the revised manuscript, Table 2 has been completely rebuilt with exactly 26 rows in the order and orientation of the original administered questionnaire: (1) Tedious–Fun; (2) Unpleasant–Fun; (3) Effective–Ineffective; (4) Simple–Complicated; (5) Worthless–Valuable; (6) Difficult–Easy; (7) Impractical–Practical; (8) Negative–Positive; (9) Useless–Useful; (10) Harmful–Educational; (11) Ugly–Beautiful; (12) Inappropriate–Appropriate; (13) Horrible–Wonderful; (14) Trivial–Important; (15) Dispensable–Essential; (16) Harmful–Beneficial; (17) Slow–Fast; (18) Uncomfortable–Comfortable; (19) Boring–Entertaining; (20) Rigid–Flexible; (21) Unnecessary–Necessary; (22) Unpleasant–Agreeable; (23) Ineffective–Effective; (24) Complicated–Simple; (25) Worthless–Valuable; (26) Time-consuming–Time-saving. Table 5 item labels have been corrected to match Table 2 exactly.

We note that the scale contains two structurally repeated pairs (items 5 and 25) and two items sharing a positive pole (items 1 and 2 both end in "Fun"). These features were present in the original administered questionnaire and are documented as such. Negatively oriented items (3, 4) were reversed prior to scoring, as stated in Section 3.4.3.

 

Comments 5: 5) Major (should-fix)

Finding: Sample characterization remains insufficient for assessing baseline comparability and external scope.

Location: Section 3.2; Section 5.1.

Impact: The revised manuscript now includes sex distribution in addition to sample size and random assignment, which is an improvement. However, the manuscript itself acknowledges that age, prior VR experience, and cybersickness susceptibility were not systematically collected. In a study comparing immersive and non-immersive modalities, these are not secondary variables; they may plausibly moderate experience, cognitive load, and attitudinal evaluation. Their absence limits both group comparability and interpretation.

Improvement: Include a baseline table with all available participant characteristics by group. If key variables were not measured, state this directly in Methods and not only in Limitations.

Response 5: As stated in Section 3.2, data on age, prior VR experience, and cybersickness susceptibility were not systematically collected prior to the study. The revised text now states this limitation directly in the Methods: "a baseline characteristics table by condition is therefore not available. This constitutes a limitation that restricts the assessment of group comparability beyond sex distribution and random assignment, as discussed further in Section 5.1."

The available participant-level information — sex, academic programme, and condition (random assignment) — is fully reported. Given that random assignment was employed and sex was not used as a covariate, the core threat to internal validity is considered modest. We fully acknowledge, however, that external validity and the assessment of moderating variables are constrained by the absence of the missing baseline data.

 

Comments 6: 6) Major (should-fix)

Finding: ANCOVA reporting is improved, but the analysis remains incomplete because formal assumption checks are not reported.

Location: Section 4.1; Section 4.3; Section 5.1; Table 6.

Impact: The revised version improves the ANCOVA table by reporting estimated SS, df, estimated MS, F, p, and partial η². This is a meaningful improvement. However, the manuscript also acknowledges that formal checks of homogeneity of regression slopes, residual normality, and homoscedasticity were not reported. This omission is not trivial, because several key conclusions rely on this model.

Improvement: If these checks were performed, report them explicitly. If they were not performed, moderate the inferential strength of the ANCOVA-based interpretations and state clearly that those results should be interpreted with caution.

Response 6: The assumption checks were not performed prior to data collection and cannot be reported retrospectively. An explicit paragraph has been added to the Discussion: "formal verification of ANCOVA assumptions (homogeneity of regression slopes, residual normality, and homoscedasticity) was not conducted in this study. Accordingly, the ANCOVA-based results should be interpreted with appropriate caution, and their replication with full assumption reporting is recommended."

This caveat is now present at three levels: within the Discussion, within the Conclusions ("the lack of formal ANCOVA assumption checks — which constrain the inferential strength of the performance-related findings"), and within Section 5.1 (Limitations).

 

Comments 7: 7) Major (should-fix)

Finding: The interpretation of cognitive load is reasonable, but there is still a gap between the observed result and the explanatory account offered.

Location: Section 4.1; Discussion.

Impact: The correction of the NASA-TLX description and the clarification that the Raw NASA-TLX variant was used represent a substantial methodological improvement. However, the Discussion still attributes the cognitive-load pattern to specific design elements of the learning object—such as viewing guides, informative hotspots, and spatial organization—without those elements being operationally described in sufficient detail in Methods or experimentally manipulated as factors in the design. You do introduce an inferential caution, which is positive, but the explanatory passage still leans on mechanisms not directly documented within the study.

Improvement: Reduce the explanatory strength of this passage by one further level and present those design elements as plausible interpretive hypotheses rather than study-established factors.

Response 7: The explanatory passage in the Discussion has been revised to reduce its assertiveness by one further level. The revised text now reads: "a result that may plausibly be related to design features of the VR learning object — such as viewing guides, informative hotspots, and clear spatial organization — although these elements were not experimentally manipulated and their individual contributions cannot be established from the present data." These elements are thus presented as plausible interpretive hypotheses rather than study-established factors.

 

Comments 8: 8) Major (should-fix)

Finding: The correspondence between text, tables, and conclusions has improved, but some presentation choices still reduce analytical precision.

Location: Table 4; Table 5; Table 6; Conclusions.

Impact: Table 4 reports the global cognitive-load score without a standard deviation, while the text places substantial interpretive weight on the global score. Table 5 reports item-level means, but the narrative interpretation moves quickly toward the total attitude score without explaining why the total score is preferable to the semantic pattern across items. In the Conclusions, the overall direction of findings is summarized adequately, but several statements still sound more definitive than warranted by the design limitations and by the unresolved measurement issues surrounding the performance test.

Improvement: Add a brief rationale for why the interpretation prioritizes global indices and revise the Conclusions so that they remain strictly proportional to the available evidence.

Response 8: We address the three sub-points separately.

(a) SD del global: A note has been added below Table 4: "The standard deviation for the global (Raw TLX) score was 1.44 for the total sample (Desktop VR SD = 1.56; Immersive VR SD = 1.34)."*

(b) Justification del score global: Section 3.4.2 now includes an explicit rationale: the Raw NASA-TLX global score is used as the primary index because it provides a multidimensional summary appropriate for between-group comparisons, consistent with its use in the validation literature [47–49]. Dimension-level means are retained in Table 4 to allow inspection of the subscale profile.

(c) Moderation of conclusions: The final paragraph of Section 6 now closes with: "These conclusions should nonetheless be read in light of the methodological limitations of the study — particularly the use of identical test items across pretest and posttest, the absence of item-level psychometric validation, and the lack of formal ANCOVA assumption checks — which constrain the inferential strength of the performance-related findings.

 

Comments 9: 9) Minor (nice-to-have)

Finding: Figure 2 continues to have mainly illustrative rather than analytical value.

Location: Figure 2 and the introductory text in Section 4.2.

Impact: The revised caption improves transparency somewhat by noting stopword removal and frequency-based weighting. However, the figure still does not contribute strong analytical evidence, and its construction through a generative tool is described too generally for reproducible evaluation. In its current form, it functions more as a complementary visualization than as a substantive scientific result.

Improvement: Either explicitly frame the figure as illustrative or describe the processing pipeline more precisely. If it does not contribute to the central argument, removal would also be reasonable.

Response 9: The introductory text for Figure 2 now states explicitly: "These visualizations are intended as an illustrative complement to the numerical data in Table 5 and do not constitute an independent analytical result. The word clouds were generated by assigning frequency-based weights to adjective items after removing stopwords; items rated 6 or 7 were classified as high-scoring and items rated 1 or 2 as low-scoring.

 

Comments 10: 10) Minor (nice-to-have)

Finding: Editorial template elements are still present in the revised manuscript.

Location: Front matter; final section of the manuscript; provisional DOI; “Author Contributions”; reference-format instruction text.

Impact: These elements do not directly affect the results, but they weaken the formal presentation of the paper and convey the impression of incomplete editorial preparation. In a second review round, this type of oversight reduces confidence in the manuscript’s overall readiness.

Improvement: Remove all placeholders and template text before resubmission.

Real strengths of the manuscript.

Response 10: All template placeholders have been removed: the 'Academic Editor' field has been updated, the Author Contributions section now contains the actual contributions of each author, and the reference-format instruction text has been deleted

 

Comments 11: 11) Major (should-fix, but as a strength to consolidate)

Finding: The revision resolves an important transparency problem in the measurement of cognitive load.

Location: Section 3.4.2; Table 4.

Impact: The explicit definition of the six NASA-TLX dimensions and the clarification that the Raw NASA-TLX variant was used substantially improve construct interpretability. This correction addresses a meaningful weakness of the previous version and strengthens the traceability of the cognitive-load findings.

Improvement: Retain this improvement and, to consolidate it, add a short justification for prioritizing the global score over inferential testing of the individual dimensions.

Response 11: We thank the reviewer for recognizing this improvement. A brief justification has been added to Section 3.4.2: the Raw NASA-TLX global score is used as the primary index because it provides a multidimensional summary appropriate for between-group comparisons, consistent with the validation literature [47–49]. Inferential tests on individual dimensions were not conducted, as this would require correction for multiple comparisons and goes beyond the study's primary objectives.

 

Comments 12: 12) Major (should-fix, but as a strength to consolidate)

Finding: The manuscript now acknowledges major methodological limitations with greater analytical honesty.

Location: Section 5.1.

Impact: Explicitly stating the absence of baseline variables, the repeated-item pre/post design, the lack of item-level psychometric analysis, and the lack of reported ANCOVA assumption checks improves the scientific transparency of the paper. This does not by itself solve those weaknesses, but it does improve the honesty and interpretability of the report.

Improvement: Carry some of this caution into Results and Conclusions so that interpretive restraint is not confined only to the Limitations subsection.

Response 12: Following this suggestion, interpretive restraint has been distributed across the manuscript: (a) Section 3.4.1 contains an explicit limitation note on the instrument; (b) Section 4.3 closes with a caveat on practice/recall effects; (c) the Discussion opens with a sentence conditioning the performance claim on the measurement limitations; (d) the Conclusions close with a paragraph linking findings to methodological constraints. Caution is thus visible at every level, not confined to Section 5.1.

  1. Response to Comments on the Quality of English Language

Point 1:

Response 1 We have addressed the four points identified: (1) terminological standardization — 'immersive VR' and 'desktop VR' are used consistently throughout, with 'XR' only in citations to XR-validated instruments; (2) claim strength has been calibrated as detailed in responses to Comments 1, 3, 6, and 8; (3) overly long sentences in Discussion and Conclusions have been split or restructured, and passages reading as literal translation have been rewritten. We intend to engage a professional native-English editing service prior to final resubmission.

Back to TopTop