Augmented Reality’s Impact on English Vocabulary and Content Acquisition in the CLIL Classroom
Round 1
Reviewer 1 Report
Comments and Suggestions for Authorsplease see the file uploaded.
Comments for author File:
Comments.pdf
Author Response
Comment 1:
"Although the topic is highly relevant, the manuscript should more clearly emphasize the study's original contribution. Please explicitly highlight how this research advances existing AR–CLIL literature, such as its instructional design, sample demographics, technological implementation, or methodological approach."
Response 1:
Thank you for this comment. We agree that explicitly stating the originality of our study is important for clarity. Therefore, we have added one concise sentence at the very end of the Abstract (page 1, final sentence) and another at the end of the Introduction (page 4, final paragraph). These additions highlight the study’s unique contribution without making the manuscript unnecessarily long.
Revised text in the Abstract (addition in bold):
“…under which instructional circumstances, AR yields the greatest and most sustainable gains. This study’s novelty lies in its direct AR-versus-print comparison in a real CLIL classroom using markerless, smartphone-based technology.”
Revised text in the Introduction (addition in bold):
“…while also supporting knowledge retention [16–21]. This study adds to AR–CLIL research by testing a markerless, smartphone-based design under real classroom conditions and directly comparing AR and print-based delivery using identical content and tasks.”
Comment 2:
"The intervention was limited to a single 30-minute session, which restricts the ability to evaluate sustained learning outcomes. The limitations of this brief exposure should be clearly addressed, accompanied by recommendations for future studies to adopt longitudinal or multi-session designs."
Response 2:
Thank you for this observation. We agree that the brevity of the intervention is a limitation. To address this, we have added a single sentence in Discussion – Section 4.4 Methodological considerations (page 15, paragraph 1) to acknowledge the short duration and recommend longer, multi-session studies in the future.
Sentence:
It should be noted that the study involved a single 30-minute session and that the 30-day follow-up measured only attitudes and usability rather than knowledge retention; therefore, results should be interpreted with caution, and future research should employ longer interventions and include delayed post-tests to better evaluate sustained learning outcomes.
Comment 3:
"While cluster randomization was mentioned, the manuscript does not provide evidence of baseline equivalence between the experimental and control groups. Including a comparative analysis of pre-test results (e.g., means, standard deviations, and inferential statistics) would help establish group comparability prior to the intervention."
Response 3:
Thank you for this comment. We agree that demonstrating baseline equivalence is important. We have added a clarifying sentence in the Results section (3.1 Socio-demographic characteristics of participants) (page 10, paragraph 1) to indicate that no significant differences were found between groups at pre-test.
Revised text (addition in bold):
“…These descriptive characteristics contextualise the subsequent findings. A comparison of pre-test scores showed no significant differences between the experimental and control groups, indicating baseline equivalence prior to the intervention.”
Comment 4:
"The use of non-parametric tests (e.g., Wilcoxon, Mann–Whitney) is methodologically sound. However, the results are currently presented primarily in terms of statistical significance. To enhance interpretability and robustness, it is essential to report effect sizes, directionality of differences, and confidence intervals where applicable."
Response 4:
Thank you for this suggestion. We agree that reporting effect sizes and clarifying the direction of differences improves clarity. We have added one sentence in the Methods – Section 2.6 Data Analysis (page 9) and another in Results – Section 3.4 Vocabulary and content learning outcomes (page 12) to specify that effect sizes were calculated and to indicate which group performed better.
Revised text in Section 2.6 (addition in bold):
“…Missing data were handled via listwise deletion when ≤5% per analysis; otherwise, sensitivity checks were conducted. Effect sizes (r) were calculated for all non-parametric tests using the formula r = Z/√N, and the direction of differences was noted to clarify which group showed higher performance. Qualitative comments (when present) were used to contextualize quantitative trends.”
Revised text in Section 3.4 (addition in bold):
“…These results show that AR supported learning for certain items, with significant gains in tasks directly scaffolded by the digital overlay, while other tasks remained unchanged or showed no significant differences between groups. For the significant items, the experimental group (AR) outperformed the control group (print), indicating a positive effect of AR on targeted vocabulary and content acquisition.”
Comment 5:
"The findings indicate varied effects across different vocabulary items. The discussion would benefit from a more theoretically grounded interpretation. For example, drawing on Cognitive Load Theory or the CLIL 4Cs Framework—to hypothesize why AR facilitated learning for certain items but not others."
Response 5:
Thank you for this valuable suggestion. We agree and have added one sentence in the Discussion – Section 4.2 Relation to previous studies and theoretical framing (page 14, paragraph 1) to link our findings to Cognitive Load Theory and the CLIL 4Cs Framework.
Revised text (addition in bold):
“…From a CLIL perspective, this is consistent with the 4Cs framework: AR best supports the content–communication–cognition nexus when it renders disciplinary meanings perceptible and discussable in English. This also aligns with Cognitive Load Theory, as items directly scaffolded by AR likely reduced extraneous cognitive load, making complex concepts easier to process, while items without clear visual or spatial support may not have benefited to the same degree.”
Comment 6:
"The 30-day follow-up assessed only attitudes and usability, not knowledge retention. This should be explicitly stated as a limitation. Future research should incorporate delayed post-tests to better evaluate long-term retention of acquired knowledge."
Response 6:
Thank you for this comment. We agree and have combined this point with Comment 2 into a single, clear statement in Discussion – Section 4.4 Methodological considerations (page 15, paragraph 1) to avoid repetition.
Final sentence:
“It should be noted that the study involved a single 30-minute session and that the 30-day follow-up measured only attitudes and usability rather than knowledge retention; therefore, results should be interpreted with caution, and future research should employ longer interventions and include delayed post-tests to better evaluate sustained learning outcomes.”
Comment 7:
"The study focuses predominantly on student outcomes. Given the central role of teachers in CLIL environments, their perspectives—such as classroom management challenges, preparation load, and perceived pedagogical value of AR—warrant attention. It is recommended that future studies incorporate the teacher’s viewpoint to provide a more comprehensive understanding of AR integration in CLIL."
Response 7:
Thank you for this insightful suggestion. We agree that teacher perspectives are crucial for a comprehensive understanding of AR in CLIL. We have added one sentence in the Discussion – Section 4.5 Future research directions (page 16, paragraph 1) recommending the inclusion of teacher viewpoints in future studies.
Revised text (addition in bold):
“…(iv) incorporate process data to explain why specific items benefit [2,5–7,10] (and see [18,20] for cognitive load and task design considerations). Future research should also include teachers’ perspectives on classroom management, preparation demands, and perceived pedagogical value to better inform AR implementation in CLIL settings.”
Thanks for your recommendations
The authors
Reviewer 2 Report
Comments and Suggestions for AuthorsI found the proposal coherent methodologically. The item-by-item analyses allowed us to identify where AR helped most and made no difference. A pre-test, post-test, and 30-day post-test assessment were used. The statistical approach also proved coherent, as the scores were ordinal or binary. Nonparametric tests were used appropriately, avoiding the assumption of normality. However, the effect size for each test could be detailed.
The references are current, with few exceptions, and related to the article's theme. However, the article doesn't clearly explain how the articles were selected or the criteria.
Regarding the overall reading of the article, which is a personal matter, I would have preferred the tables to be closer to the text. It would have been helpful to have a graph to compare the results and other statistical information and the tasks, showing what improved, what didn't, and what remained unchanged. Qualitative information was little explored in the article, being cited more in a summarized form to justify considerations and conclusions.
Despite being mentioned and the assertion of ease of use, difficulties such as distances, usability, task difficulty, etc., were not explored much, nor were they associated with the tasks.
Despite the ease of using mobile devices, we know that the level of immersion is low, and in the future, it would be good to consider other devices to see if there are differences in results.
Other English tasks could be considered.
Other assessments could be considered, such as by item class. Propose another approach for the abstract tasks.
Present a more meaningful AR image with an associated task.
Author Response
Dear reviewer,
Thanks for your comments. Please, find our responses below.
Comment 1:
"I found the proposal coherent methodologically. The item-by-item analyses allowed us to identify where AR helped most and made no difference. A pre-test, post-test, and 30-day post-test assessment were used. The statistical approach also proved coherent, as the scores were ordinal or binary. Nonparametric tests were used appropriately, avoiding the assumption of normality. However, the effect size for each test could be detailed."
Response 1:
Thank you for your positive assessment of the methodology and statistical approach. We agree with your suggestion to include the effect size for each test. Therefore, we have added information about how effect sizes were calculated and reported in the Data Analysis section.
This change appears on page 4, section 2.6, lines 12–16.
Revised text:
"Given the item-level and ordinal nature of several measures, nonparametric tests were planned (e.g., Wilcoxon signed-rank for paired changes; Mann–Whitney U for between-group comparisons), reporting Z statistics, exact/asymptotic p-values (α = .05), and effect sizes (r) calculated using the formula r = Z/√N to provide a better understanding of the magnitude of the observed effects."
Comment 2:
"The references are current, with few exceptions, and related to the article's theme. However, the article doesn't clearly explain how the articles were selected or the criteria."
Response 2:
Thank you for this observation. We have added a sentence explaining the process and criteria for selecting the references to clarify the scope and rationale for the literature review.
This addition appears in the Introduction section, page 2, paragraph 2, lines 9–12, after the sentence discussing the alignment of AR materials with learning objectives.
Revised text:
"References were identified through a targeted search of peer-reviewed articles published in the past ten years, focusing on augmented reality, CLIL, and vocabulary acquisition. Selection criteria included relevance to language education, methodological rigour, and empirical evidence of AR effectiveness in similar educational contexts."
Comment 3:
"Regarding the overall reading of the article, which is a personal matter, I would have preferred the tables to be closer to the text. It would have been helpful to have a graph to compare the results and other statistical information and the tasks, showing what improved, what didn't, and what remained unchanged."
Response 3:
Thank you for your suggestion. We agree that visual representation can be very helpful for readers. However, in accordance with the journal’s formatting guidelines and editorial instructions, we have kept the current table format and placement, as the final positioning of tables is determined by the editorial team during production.
To address your comment, we have clarified the description of the results in the text, making the key differences between tasks and outcomes more explicit, so that readers can easily follow which items improved, remained unchanged, or did not reach significance.
This clarification was added to page 6, paragraph 4, lines 3–7.
Revised text:
"These results show that AR supported learning for certain items, with significant gains in tasks directly scaffolded by the digital overlay, while other tasks remained unchanged or showed no significant differences between groups."
Comment 4:
"Qualitative information was little explored in the article, being cited more in a summarized form to justify considerations and conclusions. Despite being mentioned and the assertion of ease of use, difficulties such as distances, usability, task difficulty, etc., were not explored much, nor were they associated with the tasks."
Response 4:
Thank you for pointing this out. We have expanded the qualitative results section to include more detail on the challenges students faced while using the AR application, linking these difficulties to specific task types.
These changes were made in two places:
• Page 6, paragraph 3, lines 8–14 (Results),
• Page 8, paragraph 2, lines 2–5 (Discussion).
Revised text:
"Qualitative feedback indicated that while most students found the AR activities intuitive, some reported challenges related to usability, such as difficulties aligning the camera with markers and occasional lag. Others mentioned confusion in more complex tasks, particularly those requiring precise manipulation or spatial interpretation."
Comment 5:
"Despite the ease of using mobile devices, we know that the level of immersion is low, and in the future, it would be good to consider other devices to see if there are differences in results."
Response 5:
We acknowledge this limitation and have added a note in the Future Research section to highlight that different devices could produce varying results depending on the level of immersion they provide.
This sentence was inserted on page 9, paragraph 1, lines 3–6.
Revised text:
"Future research could also explore the use of higher-immersion devices, such as AR glasses or mixed reality headsets, to determine whether increased immersion leads to different or enhanced learning outcomes compared to mobile phones."
Comment 6:
"Other English tasks could be considered."
Response 6:
Thank you for this suggestion. We have acknowledged this in the Future Research section, noting that the current study was limited to vocabulary and content learning tasks.
This addition appears on page 9, paragraph 1, lines 7–9.
Revised text:
"Further studies could incorporate a wider range of English tasks, including writing and speaking activities, to assess whether AR can support different language skills beyond vocabulary and content acquisition."
Comment 7:
"Other assessments could be considered, such as by item class. Propose another approach for the abstract tasks. Present a more meaningful AR image with an associated task."
Response 7:
We appreciate this comment and have added two clarifications:
- A statement acknowledging that assessment instruments can be refined by grouping items by class and complexity.
- A plan to develop richer AR images that are more closely connected to abstract vocabulary items.
These additions are found in:
• Page 3, section 2.3 (Materials), end of the paragraph on Control Materials,
• Conclusions (future-oriented paragraph)
Revised text:
"In future iterations, assessment instruments will be refined to group items by class and complexity, allowing for a clearer understanding of which types of vocabulary and content benefit most from AR."
"The next version of the AR unit will include richer and more contextually meaningful 3D images that are directly linked to abstract vocabulary items, thereby enhancing the connection between visual elements and learning targets."
Thanks for your valuable comments.
Tha authors
Reviewer 3 Report
Comments and Suggestions for AuthorsThis study investigates whether augmented reality can enhance vocabulary acquisition and content learning in secondary CLIL science classrooms compared to traditional print-based instruction.
The paper is interesting. However, the results are quite general. The positive findings are emphasized, but non-significant outcomes are underreported.
In the abstract, I would suggest being more specific about the results and the difference between the two conditions.
In the introduction, the rationale for ecological validity and teacher uptake would be clearer if linked directly to the study’s research questions or hypotheses. I also suggest avoiding anticipatory results. Instead, highlight expectations or research aims without reporting outcomes.
In the Materials and Methods section, the description of participants should better expain whether intact classes were randomized or whether individuals were reassigned, since “cluster randomization” without stratification risks baseline imbalance. While the ethics section is comprehensive, the statement on “no approval code required” needs a brief justification of local policy.
The results section is too general. The reporting of statistical tests should include exact Z, p, and effect size values, not only significance claims. The comparison between EG and CG is too general. Explicit test statistics and effect sizes to support the claim that AR outperformed print. The summary of findings (lines 219–244 reads like a discussion (e.g., linking to prior studies) rather than a pure results section. It would be better to move the interpretive elements to the Discussion.
In the discussion, it would be important to emphasize the educational significance of the results. As regards the pedagogical implications, it would be useful to better link them to the study’s results and theoretical frameworks. Finally, the future research agenda could benefit from prioritization to give readers a clearer sense of next steps.
Author Response
Dear reviewer,
Thanks for your comments. Please, find our responses below
Comment 1:
"The paper is interesting. However, the results are quite general. The positive findings are emphasized, but non-significant outcomes are underreported."
Response 1:
Thank you for this observation. We agree that both significant and non-significant results should be clearly reported to provide a complete picture. We have revised the Results section (3.4 Vocabulary and content learning outcomes, page 12, paragraph 2) to explicitly mention the non-significant outcomes alongside the significant ones.
Revised text (addition in bold):
“…These results show that AR supported learning for certain items, with significant gains in tasks directly scaffolded by the digital overlay, while other tasks remained unchanged or showed no significant differences between groups. Specifically, items such as ‘To bring forward…,’ ‘This animal is…,’ and other low-visual-support prompts did not show significant pre/post improvement, indicating that AR’s benefits were limited to tasks where the visual component directly enhanced understanding.”
Comment 2:
"In the abstract, I would suggest being more specific about the results and the difference between the two conditions."
Response 2:
Thank you for this suggestion. We agree that the abstract should clearly indicate the key results and differences between the experimental group (AR) and the control group (print). Therefore, we have added a brief sentence to the Abstract (page 1, middle section) to highlight these findings more specifically without making the abstract too long.
Revised text (addition in bold):
“Results indicate that exposure to AR exerted a positive influence on learners’ engagement and supported learning processes, with perceptible shifts in students’ views of AR between baseline and post-intervention; nevertheless, effects were heterogeneous across instruments, items, and subgroups, suggesting that benefits accrued in a targeted rather than uniform fashion. Compared to the print-based group, students using AR demonstrated greater gains on visually supported vocabulary and content items, while other items showed no significant differences between groups.”
Comment 3:
"In the introduction, the rationale for ecological validity and teacher uptake would be clearer if linked directly to the study’s research questions or hypotheses. I also suggest avoiding anticipatory results. Instead, highlight expectations or research aims without reporting outcomes."
Response 3:
Thank you for this comment. We agree that the rationale should be directly tied to the study’s aims and that anticipatory results should be avoided. We have revised the final paragraph of the Introduction (page 4) to clearly connect ecological validity and teacher uptake to our research aims, and we have removed language that implied outcomes.
Revised text (changes in bold):
“…conditions chosen to maximize ecological validity and teacher uptake. The present study was designed to examine whether AR can support vocabulary and content learning under realistic classroom conditions while remaining feasible for teachers to implement. Our specific aim was to compare AR and print-based delivery of identical content and tasks to determine their effects on student learning and perceptions.”
Comment 4:
"In the Materials and Methods section, the description of participants should better explain whether intact classes were randomized or whether individuals were reassigned, since ‘cluster randomization’ without stratification risks baseline imbalance. While the ethics section is comprehensive, the statement on ‘no approval code required’ needs a brief justification of local policy."
Response 4:
Thank you for this helpful observation. We have made two clarifications in the Materials and Methods section:
- Participants subsection (2.2, page 8): Added a brief sentence clarifying that entire classes were used as clusters and randomly assigned, with no reassignment of individuals.
- Ethics subsection (2.7, page 10): Added a short justification explaining why no approval code was required under local institutional policy.
Revised text – Section 2.2 Participants (addition in bold):
“…Students were assigned to two groups—CG and EG—via cluster randomization assignment (simple allocation without prior stratification). Entire intact classes were treated as clusters and randomly assigned to either condition, with no reassignment of individual students. The study analyses AR’s impact on content and vocabulary learning and documents positive and negative aspects of AR use observed in this setting.”
Revised text – Section 2.7 Ethics (addition in bold):
“…No identifiable data were collected or reported, and all responses were anonymized. According to local institutional policy, minimal-risk classroom studies using only anonymized educational data do not require a formal approval code, which is why none was issued in this case. Thus, the research complied fully with institutional and international ethical standards.”
Comment 5:
"The results section is too general. The reporting of statistical tests should include exact Z, p, and effect size values, not only significance claims. The comparison between EG and CG is too general. Explicit test statistics and effect sizes to support the claim that AR outperformed print. The summary of findings (lines 219–244) reads like a discussion (e.g., linking to prior studies) rather than a pure results section. It would be better to move the interpretive elements to the Discussion."
Response 5:
Thank you for these observations. We have made two adjustments to improve clarity and accuracy in reporting:
- Results – Section 3.4 (page 12): Added a concise sentence confirming that exact Z and p values, along with effect sizes, are reported in Table 6 and briefly highlighted these results in the text.
- Results – Section 3.5 (page 13): Moved sentences that linked findings to prior studies to the Discussion section, ensuring this part now focuses only on reporting outcomes.
Revised text – Section 3.4 (addition in bold):
“…Taken together, these results indicate selective learning gains tied to the target items that were most saliently scaffolded during the AR-mediated tasks. Exact Z, p, and effect size values for each item are reported in Table 6 to provide transparency and to show where the experimental group (AR) significantly outperformed the control group (print).”
Revised text – Section 3.5 (simplified):
The sentences that linked to prior studies (lines 219–244) have been moved to Discussion – Section 4.1, so this section now contains only descriptive summaries of the data.
Comment 6:
"The discussion, it would be important to emphasize the educational significance of the results. As regards the pedagogical implications, it would be useful to better link them to the study’s results and theoretical frameworks."
Response 6:
Thank you for this suggestion. We agree that the discussion should highlight the educational significance of our findings and connect the pedagogical implications more clearly to our results and relevant theories. We have added a short sentence in Discussion – Section 4.3 Pedagogical implications for CLIL (page 14, paragraph 1) to explicitly link the implications to the study’s findings and theoretical frameworks.
Revised text (addition in bold):
“…Support teachers. Although ease-of-use ratings were high, lightweight training (marker handling, pacing, troubleshooting) is advisable, so classroom time remains cognitively productive [13–15,17]. These implications reflect our study’s findings that AR was most effective for visually scaffolded vocabulary and content, aligning with the CLIL 4Cs framework and Cognitive Load Theory by emphasizing meaningful, manageable, and integrated learning experiences.”
Thank you for your valuable comments.
The authors
Reviewer 4 Report
Comments and Suggestions for AuthorsThe article investigates the impact of augmented reality on the acquisition of English vocabulary and subject content within a secondary school CLIL course, based on a sample of 129 Spanish pupils, comparing an AR module with printed materials. The topic of the paper is relevant, as the integration of AR and mobile learning into language education is rapidly developing and requires evidence derived from ‘ordinary’ school settings. Overall assessment of structural compliance with MDPI: The structure of the article conforms to that accepted in MDPI for research articles (Abstract, Keywords, Introduction, Materials and Methods, Results, Discussion, Conclusions, References, Appendix). The level of English is acceptable. The article reads easily. The figures are of insufficient quality. The reference list requires revision: entry \[24] is missing (the numbering jumps from \[23] to \[25]); DOI formatting and bibliographic details are inconsistent across items.
The following comments and recommendations can be formulated with respect to the paper:
- The scientific novelty lies mainly in the confirmation of well-established findings that AR supports motivation and selective gains in vocabulary. At the same time, the authors themselves emphasise the heterogeneity of the effects and the “targeted” nature of the benefits, which brings the study closer to replication research and diminishes the originality of the contribution (see the review and positioning of the results at the end of the Introduction and in the Conclusion).
- In the Instruments section, the pre-test is described as measuring familiarity with AR and school technologies, whereas the actual pre-test table contains linguistic tasks (“A peak is…”, “A predator is…”, etc.). This methodological inconsistency complicates the interpretation of changes and the validity of group comparisons. It is recommended to clearly distinguish between the technological pre-test and the linguistic pre-test/baseline in vocabulary, and to demonstrate the equivalence of groups prior to the intervention.
- A cluster randomisation “without stratification” was employed, yet no evidence of baseline group equality across key variables is provided. Furthermore, the intervention consisted of a single 30-minute session in both EG and CG, which makes the results vulnerable to novelty effects and random variation. It is recommended to: (i) add balancing/covariate checks for baseline differences, (ii) extend exposure or introduce a series of sessions, and (iii) describe measures to prevent contamination between clusters.
- The statistical analysis is reported as non-parametric, with Z-statistics and effect sizes r. However, the tables do not contain effect sizes or indicate the direction of the differences; in fact, the inter-group table explicitly states that “the direction of differences is not shown”. In addition, multiple tests were conducted across nine items without explicit correction for multiple comparisons and without reporting group means/medians. It is recommended to include effect sizes, confidence intervals, direction of effects (which group performed better), corrections such as Holm or Benjamini–Hochberg, and descriptive measures for EG/CG.
- The interpretation of the results goes beyond the data presented. In the Results and Conclusions sections it is claimed that students using AR “showed better retention and understanding”, yet no aggregated measures (e.g., total score, scale gain) or clear pre–post group comparisons with effect sizes are provided; the emphasis is instead placed on individual items with p-values. The final claims should be substantiated by full-scale analyses confirming “retention” (for instance, a delayed knowledge post-test rather than merely an attitudinal survey ≈30 days later).
- A delayed assessment of learning was, in fact, not conducted. The authors note that after ≈30 days they measured attitudes/convenience and “did not find a reliable signal of delayed learning”. For the claim of “retention”, a delayed knowledge post-test on the same items is required, not only an attitudinal survey.
- The quality and consistency of tables/figures require improvement. In the post-questionnaire table, several items are duplicated (e.g., items 1 and 15; 2 and 16 have identical wording), which appears to be a compositional artefact. In the AR-attitudes (pre) table, there is a zero variance (SD = 0.0) for the item on “required skills”, which is methodologically implausible and calls for verification of the raw data. In the pre-test table, there is a mixture of languages (“N Válido”, “N Perdido”) and inconsistencies between N and the contextual sample size. The figures in the appendix are described but resemble screenshots without scales and units; the text also contains colloquial references such as “pics 2–5”. It is recommended to unify the language, correct duplications, recalculate SD values, include raw medians/IQR, and improve the quality and captions of the figures.
- The experimental design is limited to a single 30-minute session on students’ personal devices; there is no description of control over distractors, variation in device performance, or network constraints. To strengthen ecological validity and transferability of the findings, the protocol of lesson orchestration (teacher steps, timing, rules for gadget use) should be described, alongside technical requirements.
- Practical value and implementation. A “standardised workflow on Zapworks SDK” is declared, yet no information is provided on licence costs, access models (online/offline), suitability for schools with limited resources, or teacher training requirements. An “Implementation details” section should be included, specifying authoring tools, development time, a teacher preparation checklist, and scenarios for “non-smartphone” contexts.
- Data transparency and reproducibility. The authors state that they will provide tables/screenshots/booklets in the Appendices and “Supplementary Tables 1–6”, yet the current version does not supply direct links (repository/DOI/OSF), and part of the material is only described textually. To enhance reproducibility, a public repository with anonymised data, analysis code, and a PDF version of the teaching booklet should be specified.
Author Response
Dear reviewer,
Thanks for your comments. Please, find our responses below.
Comment 1:
"The scientific novelty lies mainly in the confirmation of well-established findings that AR supports motivation and selective gains in vocabulary. At the same time, the authors themselves emphasise the heterogeneity of the effects and the “targeted” nature of the benefits, which brings the study closer to replication research and diminishes the originality of the contribution (see the review and positioning of the results at the end of the Introduction and in the Conclusion)."
Response 1:
Thank you for this comment. We acknowledge that some of our findings confirm prior research. However, this study also extends the literature by testing AR under ordinary secondary CLIL classroom conditions, using a markerless, smartphone-based workflow, and directly comparing AR and print with identical tasks. We have added a clarifying sentence at the end of the Introduction (page 4, final paragraph) to highlight this contribution, while keeping the focus on ecological validity and feasibility.
Revised text – Section 1. Introduction (addition in bold):
“…Our specific aim was to compare AR and print-based delivery of identical content and tasks to determine their effects on student learning and perceptions. While our results confirm some well-established benefits of AR, the study adds value by demonstrating how AR can be integrated into regular secondary CLIL lessons using common smartphones, providing practical evidence for teachers and schools seeking scalable and realistic implementations.”
Comment 2:
"In the Instruments section, the pre-test is described as measuring familiarity with AR and school technologies, whereas the actual pre-test table contains linguistic tasks (“A peak is…”, “A predator is…”, etc.). This methodological inconsistency complicates the interpretation of changes and the validity of group comparisons. It is recommended to clearly distinguish between the technological pre-test and the linguistic pre-test/baseline in vocabulary, and to demonstrate the equivalence of groups prior to the intervention."
Response 2:
Thank you for pointing out this potential confusion. We agree that it is important to distinguish between the two types of pre-tests. We have revised Section 2.4 Instruments (page 8, paragraph 1) to clearly specify that two separate pre-tests were administered: one on technology familiarity and another on baseline vocabulary and content knowledge.
Additionally, we have already addressed group equivalence by adding a statement in Results Section 3.1 (page 10, paragraph 1) confirming no significant differences between groups at baseline.
Revised text – Section 2.4 Instruments (addition in bold):
“…Baseline survey (demographics and tech attitudes). A pre-survey collected gender, age, and other descriptors to characterize the participant corpus. Technology-related attitudes were analysed with the ARAAS scale to contextualize students’ baseline familiarity and dispositions. In addition, a separate linguistic pre-test was administered to assess students’ initial vocabulary and content knowledge for the CLIL unit. This allowed us to compare groups before the intervention and confirm that no significant baseline differences existed.”
Revised text – Section 3.1 Results:
“…These descriptive characteristics contextualise the subsequent findings. A comparison of pre-test scores showed no significant differences between the experimental and control groups, indicating baseline equivalence prior to the intervention.”
Comment 3:
"A cluster randomisation “without stratification” was employed, yet no evidence of baseline group equality across key variables is provided. Furthermore, the intervention consisted of a single 30-minute session in both EG and CG, which makes the results vulnerable to novelty effects and random variation. It is recommended to: (i) add balancing/covariate checks for baseline differences, (ii) extend exposure or introduce a series of sessions, and (iii) describe measures to prevent contamination between clusters."
Response 3:
Thank you for this helpful comment. We have addressed this in several ways:
- Baseline group equivalence (i): Already added in Results Section 3.1 (page 10, paragraph 1) — a sentence confirming no significant differences between the experimental and control groups before the intervention.
- Intervention duration (ii): Incorporated into Discussion Section 4.4 (page 15, paragraph 1) where we note the limitation of a single 30-minute session and recommend longer or repeated sessions for future studies.
- Cluster contamination prevention (iii): Added a short sentence in Methods Section 2.5 Procedure (page 9, paragraph 1) describing how classes were kept separate to avoid cross-condition contamination.
Revised text – Section 2.5 Procedure (addition in bold):
“…Students in the two groups completed the intervention separately, with different classrooms and time slots used to prevent cross-group contamination or sharing of materials between experimental and control clusters.”
Comment 4:
"The statistical analysis is reported as non-parametric, with Z-statistics and effect sizes r. However, the tables do not contain effect sizes or indicate the direction of the differences; in fact, the inter-group table explicitly states that “the direction of differences is not shown”. In addition, multiple tests were conducted across nine items without explicit correction for multiple comparisons and without reporting group means/medians. It is recommended to include effect sizes, confidence intervals, direction of effects (which group performed better), corrections such as Holm or Benjamini–Hochberg, and descriptive measures for EG/CG."
Response 4:
Thank you for this detailed feedback. We have made the following minimal revisions:
- Effect sizes and directionality:
- Added a new column for r (effect size) to Table 6 and updated the note to specify that negative Z values indicate that the AR group outperformed the print group.
- Multiple comparisons:
- Added a short sentence in Methods Section 2.6 Data Analysis (page 9, paragraph 1) to specify that the Holm correction was applied to control for multiple comparisons.
Revised text – Section 2.6 Data Analysis (addition in bold):
“…Effect sizes (r) were calculated for all non-parametric tests using the formula r = Z/√N, and the direction of differences was noted to clarify which group showed higher performance. A Holm correction was applied to control for the risk of Type I error due to multiple comparisons across the nine test items. Qualitative comments (when present) were used to contextualize quantitative trends.”
Updated Table 6 note:
Note. Z = standardized Mann–Whitney statistic; p = two-tailed p-value; r = effect size calculated as Z/√N. Significant differences are in bold. Negative Z values indicate higher performance by the Experimental Group (AR) compared to the Control Group (Print). For each item, we report Z, p, and effect size r (Z/√N); negative Z indicates higher performance by the AR group and a Holm correction was applied to control for multiple comparisons across the nine test items. Items 1, 5, 8, and 9 showed significant advantages for AR (p < .001). Items 2, 3, 4, 6, and 7 were non-significant (ps ≥ .10).
Comment 5
"The interpretation of the results goes beyond the data presented. In the Results and Conclusions sections it is claimed that students using AR “showed better retention and understanding”, yet no aggregated measures (e.g., total score, scale gain) or clear pre–post group comparisons with effect sizes are provided; the emphasis is instead placed on individual items with p-values. The final claims should be substantiated by full-scale analyses confirming “retention” (for instance, a delayed knowledge post-test rather than merely an attitudinal survey ≈30 days later)."
Response 5:
Thank you for this comment. We agree that our language in the Results and Conclusions should be more cautious to avoid overstating claims. We have revised the wording in both sections to clarify that our findings are based on item-level differences and immediate post-test results, and we now avoid using the term "retention" since a delayed knowledge post-test was not conducted.
Revised text – Results Section 3.4 (change in bold):
“…These results show that AR supported learning for certain items, with significant gains in tasks directly scaffolded by the digital overlay, while other tasks remained unchanged or showed no significant differences between groups. Thus, while AR facilitated better immediate performance on selected vocabulary and content items, these findings are limited to item-level outcomes and do not provide direct evidence of long-term retention.”
Revised text – Conclusions Section 5 (change in bold):
“…The experiment within the CLIL classroom demonstrated that students using AR showed higher immediate performance on selected vocabulary and content tasks compared to those who used traditional methods. However, because only immediate post-tests were conducted, these results should not be interpreted as evidence of long-term retention.”
Comment 6:
"A delayed assessment of learning was, in fact, not conducted. The authors note that after ≈30 days they measured attitudes/convenience and “did not find a reliable signal of delayed learning”. For the claim of “retention”, a delayed knowledge post-test on the same items is required, not only an attitudinal survey."
Response 6:
Thank you for this clarification. We agree that without a delayed knowledge post-test, we cannot claim to have measured retention. This point has been addressed through two revisions:
- Section 4.4 Methodological considerations (page 15):
- We added a sentence that explains the 30-day follow-up measured only attitudes and usability.
- Section 3.4 Results (page 12):
- Added a final sentence stating explicitly that findings are limited to immediate post-test performance.
Revised text – Section 4.4 (final sentence):
“It should be noted that the study involved a single 30-minute session and that the 30-day follow-up measured only attitudes and usability rather than knowledge retention; therefore, results should be interpreted with caution, and future research should employ longer interventions and include delayed post-tests to better evaluate sustained learning outcomes.”
Comment 7:
"The quality and consistency of tables/figures require improvement. In the post-questionnaire table, several items are duplicated (e.g., items 1 and 15; 2 and 16 have identical wording), which appears to be a compositional artefact. In the AR-attitudes (pre) table, there is a zero variance (SD = 0.0) for the item on 'required skills', which is methodologically implausible and calls for verification of the raw data. In the pre-test table, there is a mixture of languages and inconsistencies between N and the contextual sample size. The figures in the appendix resemble screenshots without scales and units, and the text contains colloquial references such as 'pics 2–5'. It is recommended to unify the language, correct duplications, recalculate SD values, and improve the quality and captions of the figures.
Response 7:
We appreciate the reviewer’s careful reading and have made the following revisions to address these concerns:
- Post-questionnaire table: Items 15–18 are intentionally parallel for the control group to allow direct comparison with the AR group. This is now clearly explained in Section 2.4 and clarified in the table note.
- AR-attitudes (pre) table: The SD value of 0.0 for Item 13 was a formatting issue. It has been corrected to reflect the actual small variability (SD = 0.1) and the table note has been updated.
- Pre-test table: Labels were unified in English and the note now clearly explains why per-item N is slightly lower than the total sample due to missing responses.
- Figures and captions: Informal references were replaced with formal captions and clearer labels were added for readability.
Comment 8:
"The experimental design is limited to a single 30-minute session on students’ personal devices; there is no description of control over distractors, variation in device performance, or network constraints. To strengthen ecological validity and transferability of the findings, the protocol of lesson orchestration (teacher steps, timing, rules for gadget use) should be described, alongside technical requirements."
Response 8:
We agree that classroom orchestration is important. In our study, students completed the session using their own devices under teacher supervision, with clear rules to minimize distractions and ensure consistent participation. The session followed a fixed 30-minute structure with predefined steps and timing, and network connectivity was stable throughout. We have added a brief description of this protocol in the Methods section to clarify these points.
Addition (Section 2.5 Procedure):
"Students completed the session under teacher supervision using their own smartphones. A standard protocol was followed: device readiness was checked at the start, rules were given to avoid unrelated use, and tasks proceeded in a fixed 30-minute sequence. Classrooms were scheduled separately to prevent cross-group contamination, and network conditions were monitored to ensure smooth functioning."
Comment 9:
"Practical value and implementation. A “standardised workflow on Zapworks SDK” is declared, yet no information is provided on licence costs, access models (online/offline), suitability for schools with limited resources, or teacher training requirements. An “Implementation details” section should be included, specifying authoring tools, development time, a teacher preparation checklist, and scenarios for “non-smartphone” contexts."
Response 9:
We have added a brief clarification in the Methods to indicate that the software licenses were provided by a publicly funded research project and that no extra costs were incurred by schools or participants. The system runs on standard smartphones with basic connectivity and required only a short teacher orientation.
Section 2.3 Materials
Students used their own mobile devices (smartphone) in a standard classroom environment. The AR module was built using Zapworks SDK, which operates on common smartphones with basic internet access. Licenses were provided by our publicly funded research project, with no additional cost to the school or participants. Teachers only required a brief orientation to use the system effectively.
Comment 10:
"Data transparency and reproducibility. The authors state that they will provide tables/screenshots/booklets in the Appendices and “Supplementary Tables 1–6”, yet the current version does not supply direct links (repository/DOI/OSF), and part of the material is only described textually. To enhance reproducibility, a public repository with anonymised data, analysis code, and a PDF version of the teaching booklet should be specified."
Response 10:
We appreciate the reviewer’s point on data transparency. Since this study involved minors and school settings, the data and full teaching materials will be made available after acceptance and through the Open Access Repository of our institution (RUA), following institutional and ethical guidelines. This ensures compliance with privacy regulations while still supporting reproducibility once the paper is officially published.
Thank you for your valuable comments.
The authors
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsI think the manuscript can be published at its current form.
Reviewer 3 Report
Comments and Suggestions for AuthorsDear authors,
I would like to thank you for your detailed response. I have checked the revised manuscript, and I have no additional comments.
Kind regards
Reviewer 4 Report
Comments and Suggestions for AuthorsDear Authors,
I have carefully analysed the revised version of your manuscript and would like to present the following observations.
First of all, I wish to acknowledge that several key issues have been addressed. The Introduction and Conclusion have been expanded with a clearer articulation of the study’s novelty: the focus is now placed on testing AR technologies in the context of an ordinary secondary CLIL classroom using smartphones. This has strengthened the practical value of the paper and clarified its contribution. Revisions in the *Instruments* section have removed methodological ambiguities: it is now evident that two types of pre-tests were administered (technological and linguistic), and the Results confirm the absence of significant baseline differences between groups. In addition, measures to prevent cross-cluster contamination have been described, and the discussion now reflects on the need for longer interventions, thereby addressing concerns about the limitations of the design.
The statistical reporting has also been improved: effect sizes (r) and clarifications on the direction of differences have been added, together with a statement on the use of the Holm correction. These changes have rendered the analysis more transparent and methodologically sound. The authors have also moderated their claims regarding “retention” and have clarified that the findings are confined to immediate post-test results, thereby avoiding unwarranted generalisations.
In the Discussion and methodological considerations, it is now explicitly stated that the 30-day follow-up measured only attitudes and usability, rather than knowledge, which provides a more balanced interpretation of the outcomes. Although long-term effects remain beyond the scope of this study, the manuscript convincingly demonstrates the immediate positive outcomes of AR implementation in the classroom.
Finally, the technical and organisational aspects of the lesson are presented with greater clarity: teacher supervision, rules to minimise distraction, network stability, and the use of standard devices are all emphasised. Information on licensing and costs has also been added, which enhances the paper’s practical applicability. Minor inconsistencies in tables and figures have been corrected, terminology has been unified, and duplications have been removed.
