Next Article in Journal
Optimal Weighting Factors Design for Model Predictive Current Controller for Enhanced Dynamic Performance of PMSM Employing Deep Reinforcement Learning
Previous Article in Journal
APT Detection via Hypergraph Attention Network with Community-Based Behavioral Mining
 
 
Article
Peer-Review Record

Can Generative Artificial Intelligence Outperform Self-Instructional Learning in Computer Programming?: Impact on Motivation and Knowledge Acquisition

Appl. Sci. 2025, 15(11), 5867; https://doi.org/10.3390/app15115867
by Rafael Mellado 1,* and Claudio Cubillos 2
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3:
Appl. Sci. 2025, 15(11), 5867; https://doi.org/10.3390/app15115867
Submission received: 4 March 2025 / Revised: 7 May 2025 / Accepted: 19 May 2025 / Published: 23 May 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper presents an insightful study comparing the use of Generative AI, specifically Microsoft Copilot, and instructional videos in a programming language course. The primary objective is to assess whether students feel more motivated using an AI tool or watching instructional videos. The findings suggest that students were less inclined to prefer the AI tool and that those who used Microsoft Copilot demonstrated lower learning outcomes.

The paper is well-written, and the results effectively address the research questions. However, several improvements could enhance clarity and strengthen the study’s contribution to the field.

The paper should clearly present examples of the diagnostic instruments used to assess student learning and motivation. It is unclear whether the instruments correspond to Figures 2, 3, and 5. Explicitly clarifying this would help the reader better understand the methodology.

It would be beneficial to clearly indicate which stages of the study Figures 2 and 3 correspond to. This would improve the flow of the explanation and ensure that readers can easily follow the experimental design.

The readability of inline code snippets could be improved by using a distinct font style or color. This would enhance clarity and distinguish code from the main text.

On page 11, there is a standalone term, "Instruments." It is unclear whether this serves as a subsection title or if it was misplaced. If it is a heading, it should follow the paper’s formatting conventions.

The paper should consider discussing whether Microsoft Copilot is the most appropriate AI tool for assisting students in learning programming. Other AI tools may be more user-friendly or better suited for educational contexts. Additionally, it is not specified whether students received prior training on using Microsoft Copilot or guidance on crafting effective prompts. This could have significantly influenced the results and should be addressed.

The conclusion suggests that Microsoft Copilot may be more beneficial for students with a strong conceptual foundation in programming. However, given that the participants had already taken other programming courses, it would be helpful to clarify why their foundations might still be insufficient. Further discussion on this point would strengthen the conclusions.

A significant portion of the references (over 40%) are more than five years old. Incorporating more recent literature would ensure the study is contextualized within the latest advancements in AI-assisted learning and programming education.

This study addresses an important and timely topic by evaluating different learning strategies in programming education. The comparison between AI-driven assistance and instructional videos provides valuable insights, and with the suggested refinements, the paper could offer an even stronger contribution to the field.

Author Response

Comments 1: The paper should clearly present examples of the diagnostic instruments used to assess student learning and motivation. It is unclear whether the instruments correspond to Figures 2, 3, and 5. Explicitly clarifying this would help the reader better understand the methodology.

Response 1: We appreciate your valuable comments and suggestions on improving the clarity and understanding of our manuscript. In response to your request to present clear examples of the diagnostic instruments used to evaluate student learning and motivation, we have integrated an additional paragraph in the paper that briefly describes a specific exercise used in the pretest and posttest. This exercise focuses on identifying and correcting errors in a PHP code that interacts with a MySQL database, providing a concrete example of the types of questions and challenges students faced during the evaluation. Additionally, we have included Figure 2, which visually illustrates the described exercise, providing a clear reference for readers. We hope this addition clarifies the methodology employed and provides readers with a more comprehensive understanding of the instruments used in the study. Thank you again for your comments, and we remain attentive to any further suggestions that may improve our work.

 

Comments 2: It would be beneficial to clearly indicate which stages of the study Figures 2 and 3 correspond to. This would improve the flow of the explanation and ensure that readers can easily follow the experimental design.

Response 2: We agree that explicitly indicating to which stages of the study Figures 2 and 3 (or the relevant figures) correspond will improve the flow of the explanation and the understanding of the experimental design. In the revised version, we have updated the text in Section 3.3, "Process," to specify that Figure 1 corresponds to the general description of the four stages of the experimental process (initial explanation, diagnosis, intervention, and evaluation). Additionally, we have clarified that Figure 2 illustrates a specific example of the exercise used during Stage 3 (intervention) for the experimental group that employed MS Copilot. The changes are highlighted in yellow in the document.

 

Comments 3: The readability of inline code snippets could be improved by using a distinct font style or color. This would enhance clarity and distinguish code from the main text.

Response 3: Thank you for your comment regarding the readability of the code snippets. We recognize that distinguishing the code from the main text is essential for the clarity of the article. Given that the code snippets, such as those presented in Figure 2, are screenshots and not directly editable inline text that can be styled with a different font or color, we have opted to increase the size of the figures in the revised version. This adjustment enhances the readability of the code and text within the images, ensuring they are better differentiated from the main content. If the reviewer considers that further distinction is required (such as a border or background), we are open to implementing specific suggestions. Furthermore, we have enhanced the explanation of these images in the corresponding paragraphs.

 

Comments 4: On page 11, there is a standalone term, "Instruments." It is unclear whether this serves as a subsection title or if it was misplaced. If it is a heading, it should follow the paper’s formatting conventions.

Response 4: Indeed, it was intended to be a level 2 heading, but it was neither formatted nor numbered correctly in the original manuscript. In the revised version, we have corrected this by renaming it as '3.4 Instruments' and applying the level 2 heading format consistent with the rest of the article (numbering and boldface), in accordance with the formatting conventions of Applied Sciences. This eliminates any ambiguity and improves the structure of the text.

 

Comments 5: The paper should consider discussing whether Microsoft Copilot is the most appropriate AI tool for assisting students in learning programming. Other AI tools may be more user-friendly or better suited for educational contexts. Additionally, it is not specified whether students received prior training on using Microsoft Copilot or guidance on crafting effective prompts. This could have significantly influenced the results and should be addressed.

Response 5: Regarding the selection of MS Copilot, this tool was chosen because students have free access to it through their Office 365 accounts provided by the university, ensuring its availability and reducing access barriers within the context of our study. Additionally, MS Copilot offers robust functionalities for code generation and real-time feedback, aligning with the PHP programming learning objectives. While we recognize that other AI tools, such as ChatGPT or Google Gemini, might be more user-friendly or specifically designed for educational environments, the students' familiarity with MS Copilot (due to its use in previous iterations of the course) and its integration with the university ecosystem made it a practical and relevant choice. We have added a brief discussion about this in Section 3, 'Experimental Design,' to justify our choice and acknowledge alternatives.

As for prior training, no specific training or guidelines on formulating prompts for MS Copilot were provided in this study. This was because the participants, third-year industrial engineering students, had already used MS Copilot and other tools like Google Gemini in previous course activities and other academic contexts, giving them some prior experience. We considered this familiarity sufficient to minimize the need for additional guidance. However, we acknowledge that the lack of explicit training may have influenced the results, particularly the students' ability to fully leverage the tool. We have included this clarification in Section 3.3, 'Process,' and discuss its implications in Section 6, 'Conclusions,' as a limitation and a potential area for future research.

 

Comments 6: The conclusion suggests that Microsoft Copilot may be more beneficial for students with a strong conceptual foundation in programming. However, given that the participants had already taken other programming courses, it would be helpful to clarify why their foundations might still be insufficient. Further discussion on this point would strengthen the conclusions.

Response 6: We appreciate your comment regarding the need to clarify why the participants’ foundational knowledge might not have been sufficient for them to fully benefit from Microsoft Copilot (MS Copilot), despite having taken prior programming courses. We recognize that this apparent contradiction requires a more detailed explanation. The participants, third-year industrial engineering students, had completed previous modules in Java and databases, which provided them with a foundation in programming concepts. However, learning PHP—a new language for them in this course—combined with the lack of specific training in using MS Copilot (as detailed in the previous revision), may have limited their ability to effectively leverage the tool. Specifically, interacting with MS Copilot requires advanced metacognitive skills, such as formulating precise prompts and validating responses, which may not be fully developed in students at an intermediate stage of programming learning. We have expanded the discussion in Section 6, 'Conclusions,' to address this point, explaining that while the students had prior foundational knowledge, it may not have been sufficiently robust or specific to the context of PHP and the autonomous use of generative AI. This reinforces our suggestion that MS Copilot would be more beneficial in advanced stages, when students have a more consolidated foundation.

 

Comments 7: A significant portion of the references (over 40%) are more than five years old. Incorporating more recent literature would ensure the study is contextualized within the latest advancements in AI-assisted learning and programming education.

Response 7: We appreciate your observation regarding the age of a significant portion of the references. We acknowledge that more than 40% of the citations predate 2020, reflecting a reliance on foundational theories in technology-assisted learning (such as TAM, cognitive load theory, and HMSAM). However, we understand the importance of contextualizing our study within the most recent advances in generative AI and programming education. In the revised version, we have incorporated additional literature from the last five years (2020–2025), including studies on the impact of generative AI tools such as ChatGPT and Google Gemini in education, as well as recent research on pedagogical strategies in programming. These new references complement the existing theoretical citations without removing them, ensuring a balance between established foundations and current developments. The updated reference list now reflects a greater emphasis on recent literature, with approximately 60% of the citations originating from 2020 or later. We have added 11 new references in the field:

  1. Nettur, S. B., Karpurapu, S., Nettur, U., & Gajja, L. S. (2024). Cypress Copilot: Development of an AI Assistant for Boosting Productivity and Transforming Web Application Testing. IEEE Access. https://doi.org/10.1109/ACCESS.2024.3521407
  2. Mahadevappa, P., Muzammal, S. M., Tayyab, M., Mahadevappa, P., Muzammal, S. M., & Tayyab, M. (2025). Introduction to Generative AI in Web Engineering: Concepts and Applications. Https://Services.Igi-Global.Com/Resolvedoi/Resolve.Aspx?Doi=10.4018/979-8-3693-3703-5.Ch015, 297–330. https://doi.org/10.4018/979-8-3693-3703-5.CH015
  3. Jayachandran, D., Maldikar, P., Love, T. S., & Blum, J. J. (2024). Leveraging Generative Artificial Intelligence to Broaden Participation in Computer Science. Proceedings of the AAAI Symposium Series, 3(1), 486–492. https://doi.org/10.1609/AAAISS.V3I1.31262
  4. Ho, C. L., Liu, X. Y., Qiu, Y. W., & Yang, S. Y. (2024). Research on Innovative Applications and Impacts of Using Generative AI for User Interface Design in Programming Courses. ACM International Conference Proceeding Series, 68–72. https://doi.org/10.1145/3658549.3658566
  5. Huang, J., & Mizumoto, A. (2024). Examining the effect of generative AI on students’ motivation and writing self-efficacy. Digital Applied Linguistics, 1, 102324–102324. https://doi.org/10.29140/DAL.V1.102324
  6. Krouska, A., Mylonas, P., Kabassi, K., Caro, J., Sgouropoulou, C., Hmoud, M., Swaity, H., Hamad, N., Karram, O., & Daher, W. (2024). Higher Education Students’ Task Motivation in the Generative Artificial Intelligence Context: The Case of ChatGPT. Information 2024, Vol. 15, Page 33, 15(1), 33. https://doi.org/10.3390/INFO15010033
  7. Ghimire, A., & Edwards, J. (2024). Generative AI Adoption in Classroom in Context of Technology Acceptance Model (TAM) and the Innovation Diffusion Theory (IDT). https://arxiv.org/abs/2406.15360v1
  8. Lin, Z., & Ng, Y. L. (2024). Unraveling Gratifications, Concerns, and Acceptance of Generative Artificial Intelligence. International Journal of Human–Computer Interaction. https://doi.org/10.1080/10447318.2024.2436749
  9. Al-Abdullatif, A. M. (2024). Modeling Teachers’ Acceptance of Generative Artificial Intelligence Use in Higher Education: The Role of AI Literacy, Intelligent TPACK, and Perceived Trust. Education Sciences 2024, Vol. 14, Page 1209, 14(11), 1209. https://doi.org/10.3390/EDUCSCI14111209
  10. Chakraborty, S. (2024). Generative AI in Modern Education Society. Computers and Society. https://arxiv.org/abs/2412.08666v1

Ko, S., Chan, S. C. H., Ko, S., & Chan, S. C. H. (1 C.E.). A Framework for the Responsible Integration of Generative AI Tools in Learning. Https://Services.Igi-Global.Com/Resolvedoi/Resolve.Aspx?Doi=10.4018/979-8-3373-1017-6.Ch006, 163–194. https://doi.org/10.4018/979-8-3373-1017-6.CH006


All changes are marked in yellow within the document.

Reviewer 2 Report

Comments and Suggestions for Authors
  1. The aim of the present paper is to evaluate the capacity of a GenAI (Microsoft Copilot) compared to an instructional video in learning programming (PHP), and generating positive emotional effects.
  2. The literature review is reveals the main and most significant works for such a study. However, the theory of cognitive absorption can be understood better if supported by the theory of knowledge dynamics and the theory of knowledge fields (rational, emotional, and spiritual). Enjoyment, curiosity, immersion and final satisfaction are all based on the transformation capacity of each knowledge field to be transformed into another knowledge field during the process of learning. We recommend the authors to study from this point of view at least the following paper: Bratianu, C. & Bejinaru, R. (2020). Knowledge dynamics: A thermodynamic approach. Kybernetes, 49(1), 6-21. https://doi.org/10.1108/K-02-2019-0122.
  3. The authors used an empirical approach, with 71 university students in industrial engineering, within the framework of a programming course. They were not students in computer science and that explains why curiosity, enjoyment and immersion are so important.
  4. Findings should be understood within the given context and purpose. The authors should try to discuss the possibility of generalizing some of these findings to other courses or other study domains.
  5. The authors should underline that while GenAI applications are general tools for learning, the audiovisual tools used are specifically designed for learning such complex concepts and algorithms, and that explains their better outcomes.

Author Response

Comments 1: The literature review is reveals the main and most significant works for such a study. However, the theory of cognitive absorption can be understood better if supported by the theory of knowledge dynamics and the theory of knowledge fields (rational, emotional, and spiritual). Enjoyment, curiosity, immersion and final satisfaction are all based on the transformation capacity of each knowledge field to be transformed into another knowledge field during the process of learning. We recommend the authors to study from this point of view at least the following paper: Bratianu, C. & Bejinaru, R. (2020). Knowledge dynamics: A thermodynamic approach. Kybernetes, 49(1), 6-21. https://doi.org/10.1108/K-02-2019-0122.

Response 1: We thank the reviewer for their valuable suggestion to enrich the understanding of cognitive absorption theory by integrating the theories of knowledge dynamics and knowledge fields (rational, emotional, and spiritual). We recognize that these theories offer a complementary perspective on how enjoyment, curiosity, immersion, and ultimate satisfaction emerge from the transformation of knowledge across different fields during the learning process. To address this observation, we have revised Section 1.2 (Theoretical Framework) of the manuscript, incorporating a brief discussion on how knowledge dynamics and knowledge fields can underpin the cognitive absorption constructs of the HMSAM model. Specifically, we have added an explanation of how the transformation of rational knowledge (e.g., understanding programming concepts) into emotional knowledge (e.g., enjoyment or curiosity in problem-solving) can amplify the learning experience with tools such as Microsoft Copilot and instructional videos. This integration strengthens the theoretical framework without deviating from the main focus of the study, which is to empirically compare the impact of both methodologies on programming education. We hope that this revision adequately addresses your suggestion and enhances the theoretical clarity of the article. The changes are highlighted in yellow. The new bibliography added is:

  1. Bratianu, C., & Bejinaru, R. (2020). Knowledge dynamics: a thermodynamics approach. Kybernetes, 49(1), 6–21. https://doi.org/10.1108/K-02-2019-0122/FULL/XML
  2. Bratianu, C., & Garcia-Perez, A. (2023). Knowledge Dynamics and Expert Knowledge Translation: A Case Study. European Conference on Knowledge Management, 24(1), 140–147. https://doi.org/10.34190/ECKM.24.1.1382
  3. Qadhi, S., & Qadhi, S. (2023). Knowledge Dynamics: Educational Pathways from Theories to Tangible Outcomes. From Theory of Knowledge Management to Practice. https://doi.org/10.5772/INTECHOPEN.1002979

 

Comments 2: The authors used an empirical approach, with 71 university students in industrial engineering, within the framework of a programming course. They were not students in computer science and that explains why curiosity, enjoyment and immersion are so important.

Response 2: We appreciate the reviewer’s insightful observation regarding the empirical approach involving 71 industrial engineering students, rather than computer science students, and its potential influence on the significance of curiosity, enjoyment, and immersion in the context of a programming course. We agree that the participants’ academic background likely shaped their learning experience, as industrial engineering students may have less prior exposure to programming compared to computer science students, making affective factors such as curiosity, enjoyment, and immersion particularly critical for their engagement and success. To address this point, we have revised Section 5.1 (Learning Effects) and Section 6 (Conclusions) to explicitly acknowledge the role of the participants’ profile in interpreting the results. Specifically, we have added a discussion on how the limited programming experience of industrial engineering students may amplify the importance of these affective dimensions, and we note that the findings might differ with computer science students who possess stronger foundational skills. This clarification enhances the contextualization of our findings and underscores the relevance of tailoring instructional strategies to learners’ backgrounds. We believe this revision strengthens the manuscript by aligning the interpretation of the results with the participants’ characteristics.

 

Comments 3: Findings should be understood within the given context and purpose. The authors should try to discuss the possibility of generalizing some of these findings to other courses or other study domains.

Response 3: We thank the reviewer for highlighting the importance of contextualizing our findings and exploring their potential generalizability to other courses or study domains. We agree that while our results are grounded in the specific context of industrial engineering students learning PHP, discussing their broader applicability enhances the study’s relevance. To address this suggestion, we have revised Section 6 (Conclusions) by adding a discussion on the possibility of generalizing the findings to other programming courses and study domains beyond programming. We propose that the effectiveness of instructional videos and generative AI tools like MS Copilot may extend to contexts where structured guidance and reduced cognitive load are critical for novices, while acknowledging that differences in learner profiles, course content, and disciplinary demands could modulate these effects. Additionally, we have expanded the limitations and future research directions to emphasize the need for further studies across diverse educational settings and domains to validate such generalizations. We believe this revision provides a balanced perspective on the scope of our findings and their potential implications.

 

Comments 4: The authors should underline that while GenAI applications are general tools for learning, the audiovisual tools used are specifically designed for learning such complex concepts and algorithms, and that explains their better outcomes.

Response 4: We appreciate the reviewer’s suggestion to emphasize the distinction between the general-purpose nature of generative AI (GenAI) tools and the specifically designed audiovisual tools used in this study, as a key factor explaining their differing outcomes. We agree that while GenAI applications like Microsoft Copilot serve as versatile tools across various domains, the instructional videos were purposefully crafted to teach complex programming concepts and algorithms in PHP, which likely contributed to their superior performance in this context. To address this observation, we have revised Section 5.1 (Learning Effects) to explicitly highlight that the pedagogical design of the videos—tailored to reduce cognitive load and provide structured explanations—offers an advantage over the more general functionality of MS Copilot. Additionally, we have reinforced this point in Section 6 (Conclusions) to clarify how the intentional design of audiovisual tools aligns with their effectiveness for novice learners. We believe this revision strengthens the interpretation of our findings by underscoring the role of tool-specific design in educational outcomes.

All changes are marked in yellow within the document.

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript investigates whether a generative AI tool such as Microsoft Copilot can outperform instructional videos in teaching PHP programming to university students. The study employs a quasi-experimental design with pre-tests and post-tests along with HMSAM-based measures to compare learning outcomes and affective responses. However, there are several critical methodological issues that call into question the reliability of the findings.

 

1 The authors do not provide sufficient details on the randomization process used to assign students to the two groups. It is unclear how potential pre-existing differences were controlled beyond using a pre-test.

 

2 The intervention appears to be a one-time session with no follow-up assessments. The authors should discuss how the short duration might affect the validity of the learning outcomes.

 

3 The pre-test and post-test instruments used to assess PHP knowledge lack information on their validation and reliability. The authors need to clarify how these tests were developed and whether they have been previously validated.

 

4 The description of the MS Copilot intervention is vague. The authors should specify how students were trained to interact with the AI tool and ensure that its usage was standardized across participants.

 

5 The study does not account for individual differences in digital literacy or prior exposure to AI tools. The authors should address how these factors may have influenced the results and consider them as potential confounding variables.

 

6 The statistical analysis raises concerns about the assumptions underlying ANCOVA and ANOVA. The authors should provide evidence of tests for normality and homogeneity of variance to support their use of these methods.

 

7 The rationale for selecting the specific HMSAM constructs and their operationalization in the study is not clearly justified. The authors need to explain how each construct is relevant to the context of programming education.

 

8 The claim that instructional videos are more effective is based solely on immediate post-test results. The authors should temper their conclusions given the absence of longitudinal data to support long-term learning benefits.

 

9 The potential influence of differences in the quality and structure of the instructional content between the video and AI groups is not addressed. The authors need to discuss whether variations in content delivery might have contributed to the observed differences.

 

10 The study focuses exclusively on PHP and does not discuss how the choice of programming language may limit the generalizability of the findings. The authors should acknowledge this limitation and consider its implications.

 

11 The integration of follow-up questions in the MS Copilot group is not clearly explained. The authors must detail how these questions were incorporated into the intervention and how they may have affected the learning process.

  • The authors do not explain the randomization procedure in enough detail to confirm that both groups had balanced baseline characteristics. How were participants assigned to the two interventions and how can the authors ensure that this process minimized differences unrelated to the treatment?
  • The authors use a one-shot intervention with immediate post-tests. The short duration and lack of any delayed assessment raise questions about whether the findings reflect stable or merely short-term effects on learning.
  • The development and validation process for the pre-test and post-test instruments is not described. The authors should clarify how they ensured that the knowledge tests accurately measured PHP proficiency and possessed reliable psychometric properties.
  • The description of the MS Copilot group is not detailed enough to confirm consistent use of the AI tool. The authors should discuss how each student was guided in using Copilot, how many queries they made, and whether they had prior practice with AI-driven tools.
  • The manuscript does not report controlling for differences in digital literacy, which may influence how quickly students adapt to AI tools. The authors could measure or acknowledge these potential confounders and explore how they affect the study results.
  • The ANCOVA and ANOVA methods require assumptions such as normality and homogeneity of variance. The authors should report tests of these assumptions and discuss whether the data met the criteria for valid parametric analysis.
  • The rationale behind the choice of the specific HMSAM variables and how they apply to learning to program in PHP is not fully articulated. The authors should explain how constructs like temporal dissociation or immersion specifically apply to this programming context.
  • The claim that instructional videos are more effective may be overstated because there is no evidence about longer-term knowledge retention or continued skill development. The authors are encouraged to discuss potential limitations of short-term assessments.
  • There is little discussion of how the content in the instructional video group was structured and whether it was more straightforward or thorough than the guidance given by Copilot. The authors might compare the nature or structure of both interventions to determine whether this influenced the results.
  • The choice of PHP as the sole language in this study could limit generalization to other programming contexts. The authors should acknowledge that some of the findings might differ if the language or complexity level changed.
  • The MS Copilot group was given follow-up questions, but the integration of these prompts into the overall learning process is not fully explained. The authors should clarify how these guided questions were used and how they might have influenced the outcomes.
 

Author Response

Comments 1: The authors do not provide sufficient details on the randomization process used to assign students to the two groups. It is unclear how potential pre-existing differences were controlled beyond using a pre-test.

Response 1: We appreciate the reviewer’s comment, as it highlights an opportunity to provide greater clarity on our experimental design. To address this concern, we have revised Section 3.1 ("Participants") of the manuscript to include a more detailed description of the randomization process. Specifically, the 71 students were assigned to the two groups (35 in the MS Copilot group and 36 in the instructional video group) using the random group assignment feature available in the Moodle platform, which automatically distributes participants into groups randomly. This process was conducted before the intervention, ensuring an unbiased allocation of students to either the MS Copilot or video condition, with each participant having an equal chance of assignment. Regarding controlling potential pre-existing differences, we relied primarily on the pre-test to establish a baseline of PHP knowledge across both groups, as reported in Section 4.1 ("Learning Effects"). The pre-test results, analyzed via ANOVA, showed no significant differences in prior knowledge between the groups (F(1,69) = 0.451, p = 0.50), suggesting that the randomization effectively balanced initial PHP proficiency. While additional demographic or academic variables (e.g., age, gender, or prior course performance) were not analyzed due to the scope of this study and resource constraints, the random assignment via Moodle and the pre-test equivalence provide reasonable assurance that pre-existing differences were adequately controlled for this quasi-experimental design. We have clarified these points in the revised manuscript and acknowledge in Section 6 ("Conclusions") that future studies could benefit from examining additional variables to validate group comparability further. We believe these revisions address the reviewer’s concern by enhancing transparency about the randomization process and the approach to controlling pre-existing differences.

 

Comments 2: The intervention appears to be a one-time session with no follow-up assessments. The authors should discuss how the short duration might affect the validity of the learning outcomes.

Response 2: To clarify, the intervention was a single-session activity spanning seven days, during which students completed the PHP practice guide and final assessments described in Section 3.3 ("Process"). No follow-up assessments were conducted beyond the immediate post-test. To address this observation, we have expanded Section 6 ("Conclusions") in the revised manuscript to include a discussion on how the intervention's short duration might influence the learning outcomes' validity.

In this revision, we acknowledge that the one-time nature of the intervention primarily captures short-term learning gains and may not fully reflect long-term retention or skill application. We discuss how this limitation could affect the external validity of the results, particularly in terms of generalizing the findings to sustained programming proficiency. However, we also note that the study’s focus was to compare the immediate effectiveness of MS Copilot versus instructional videos in a controlled setting, and the quasi-experimental design, supported by pre- and post-tests, provides a valid snapshot of learning outcomes within this timeframe. We further suggest that future research should incorporate longitudinal assessments to evaluate the durability of these effects over time.

 

Comments 3: The pre-test and post-test instruments used to assess PHP knowledge lack information on their validation and reliability. The authors need to clarify how these tests were developed and whether they have been previously validated.

Response 3: To address this, we have revised Section 3.4 ("Instruments") to clarify that these tests were developed based on a pool of 18 exercises aligned with the course’s PHP learning objectives, previously used in the software systems course over multiple semesters. While not formally validated in prior studies, their content validity was ensured through expert review by two instructors with over five years of PHP teaching experience. Reliability was assessed post-hoc using Cronbach’s alpha on the post-test scores, yielding a value of 0.82, indicating good internal consistency. These details have been added to enhance transparency, and we acknowledge in Section 6 that future studies could further validate these instruments.

 

Comments 4: The description of the MS Copilot intervention is vague. The authors should specify how students were trained to interact with the AI tool and ensure that its usage was standardized across participants.

Response 4: Although participants did not receive explicit, formal training or a standardized interaction protocol immediately before the intervention, students' previous experiences provided a relevant baseline. Specifically, students had prior exposure to generative artificial intelligence tools (e.g., ChatGPT and Google Gemini) and, importantly, direct familiarity with Microsoft Copilot from earlier course activities conducted within the same semester. This previous exposure included informal experiences during workshops, classroom demonstrations, and assignments in which students frequently consulted generative AI tools to support coding tasks, error-checking, and troubleshooting in contexts related to Java programming and databases. Thus, when approaching the current PHP exercises, participants already possessed functional knowledge on formulating queries, evaluating AI-generated responses, and correcting outputs provided by MS Copilot. Despite this prior familiarity, we acknowledge that the absence of explicit instructions or formal standardization protocols at the beginning of the intervention could have introduced variability in how students interacted with the tool. We have clarified this situation in the manuscript, transparently describing students’ prior experience and clearly indicating the potential methodological limitation. Additionally, we discuss the implications of this limitation and offer recommendations for future research to address it more rigorously.

 

Comments 5: The study does not account for individual differences in digital literacy or prior exposure to AI tools. The authors should address how these factors may have influenced the results and consider them as potential confounding variables.

Response 5: We agree that individual differences in digital literacy and prior exposure to AI tools are relevant factors that could influence interactions and learning outcomes. In this study, we did not explicitly measure digital literacy or quantify prior exposure to AI tools separately. However, we administered a pretest focused on baseline PHP programming knowledge and analyzed these pretest scores between experimental groups. The results indicated no statistically significant differences, thus demonstrating initial equivalence in programming knowledge before the intervention. Despite this initial equivalence in programming knowledge, we acknowledge that the pretest did not directly capture participants' levels of digital literacy or quantify their familiarity with generative AI tools beyond assuming informal prior exposure from earlier course activities. Consequently, variability in these factors may have contributed to differences in how effectively students interacted with MS Copilot, potentially influencing performance and perceptions of usefulness.  To address this issue transparently, we have explicitly recognized the absence of explicit measurements of digital literacy and prior AI exposure as a methodological limitation. We also provide an extended discussion on how future studies could systematically measure and control these dimensions to improve the validity and generalizability of findings.

 

Comments 6: The statistical analysis raises concerns about the assumptions underlying ANCOVA and ANOVA. The authors should provide evidence of tests for normality and homogeneity of variance to support their use of these methods.

Response 6: We appreciate your feedback regarding the statistical analysis. In response to your concern about the assumptions underlying ANCOVA and ANOVA, we have added evidence of tests for normality and homogeneity of variance in the manuscript's Results section. Specifically, we included the Shapiro-Wilk test for normality and Levene's test for homogeneity of variance to ensure the validity of these statistical methods. These additions provide the necessary support for using ANCOVA and ANOVA in our analysis, and we hope they address your concerns. Thank you for bringing this to our attention.

 

Comments 7: The rationale for selecting the specific HMSAM constructs and their operationalization in the study is not clearly justified. The authors need to explain how each construct is relevant to the context of programming education.

Response 7: While the manuscript presents the general structure of the Hedonic-Motivation System Adoption Model (HMSAM), we acknowledge that the relevance of each specific construct within the context of programming education—particularly with generative AI tools—was not fully elaborated.

To address this, we have now explicitly justified the inclusion of each of the HMSAM constructs (usefulness, ease of use, behavioral intention to use, enjoyment, control, focused immersion, temporal dissociation, and curiosity) based on their empirical and theoretical alignment with the motivational and cognitive dynamics involved in learning computer programming. For instance, constructs like control and focused immersion are particularly salient given the mental demands of debugging and logical problem-solving; enjoyment and curiosity are key for sustaining engagement in an effortful task like programming; while perceived usefulness and intention to use capture the student's perception of technological value, which is critical when integrating AI tools into educational contexts. We have integrated this extended justification into the theoretical framework section to clarify how each dimension of HMSAM contributes to understanding students’ experiences in programming tasks supported by AI or instructional videos.

 

Comments 8: The claim that instructional videos are more effective is based solely on immediate post-test results. The authors should temper their conclusions given the absence of longitudinal data to support long-term learning benefits.

Response 8: We agree with the reviewer that these findings should not be generalized to long-term learning effects without further empirical evidence. As such, we have revised the relevant sections of the manuscript to temper our conclusions and explicitly acknowledge the limitations associated with the lack of longitudinal data. Furthermore, we now emphasize the need for future research to include delayed post-tests or follow-up assessments to evaluate retention, transfer of knowledge, and the sustained impact of instructional methods over time.

 

Comments 9: The potential influence of differences in the quality and structure of the instructional content between the video and AI groups is not addressed. The authors need to discuss whether variations in content delivery might have contributed to the observed differences

Response 9: While both groups worked with the same set of PHP programming exercises, we acknowledge that the materials' structure, consistency, and instructional quality differed substantially between conditions. The instructional video was a carefully scripted and pre-recorded resource, designed to provide systematic explanations and pedagogically sequenced feedback. In contrast, students in the MS Copilot group received AI-generated responses that varied depending on the phrasing of their prompts, the timing of queries, and their own interpretative choices.

As such, we agree that differences in content delivery—particularly the higher degree of structure, clarity, and instructional intent embedded in the video—may have contributed to the superior outcomes observed in that group. We have addressed this point in the Discussion section, framing it as a plausible explanatory factor and suggesting that future studies control for instructional structure when comparing pedagogical modalities.

 

Comments 10: The study focuses exclusively on PHP and does not discuss how the choice of programming language may limit the generalizability of the findings. The authors should acknowledge this limitation and consider its implications.

Response 10: We thank the reviewer for this valuable comment. Indeed, the exclusive focus on PHP represents a potential limitation in terms of generalizability. While PHP was selected due to its relevance in the course curriculum and widespread use in web programming, we recognize that different programming languages present distinct syntactic, semantic, and cognitive challenges, which may interact differently with instructional modalities such as generative AI or instructional videos. For example, languages like Python or JavaScript may offer different levels of abstraction, error tolerance, or readability, which could influence how students benefit from AI-generated support or video-based instruction. We have now explicitly acknowledged this limitation in the Discussion section and emphasized the need for future research to replicate the study across different programming languages and paradigms to assess the consistency of our findings.

 

Comments 11: The integration of follow-up questions in the MS Copilot group is not clearly explained. The authors must detail how these questions were incorporated into the intervention and how they may have affected the learning process.

Response 11: We thank the reviewer for raising this important point. Indeed, the MS Copilot group’s instructional material included follow-up questions embedded directly into the exercise guide, as illustrated in Figure 2 of the manuscript. These questions were designed to scaffold critical thinking and reduce passive acceptance of AI-generated responses. For example, after querying Copilot for assistance with PHP syntax or logic, students encountered prompts such as: "Did the assistant take into account that the variable $objetos is outside the PHP block?". These questions encouraged students to verify the accuracy and contextual relevance of the AI outputs. They served as cognitive prompts to help students reflect on the correctness of the code and, when necessary, reformulate their queries. We have now expanded the explanation of this element in the Process section and discussed its pedagogical function and potential impact variability in the Discussion section.

 

Comments 12: The authors do not explain the randomization procedure in enough detail to confirm that both groups had balanced baseline characteristics. How were participants assigned to the two interventions and how can the authors ensure that this process minimized differences unrelated to the treatment?

Response 12: We thank the reviewer for highlighting the need for a more detailed explanation of the randomization procedure. Based on previous observations from you and other reviewers, it was added to the manuscript (Section 3.1, Participants) that randomization was performed using the Moodle platform. However, to enhance clarity and address the concern regarding group balance, we have expanded the description of the randomization process and included information supporting the equivalence of baseline characteristics. Specifically, participants were assigned randomly through the Moodle "random group assignment" feature, which ensures that all enrolled students have an equal probability of being assigned to any group. This method is automated, unbiased, and executed before the intervention starts. We also clarify that other demographic or academic variables were not considered, as the scope of the study focused on the equivalence in programming knowledge. We have now included this additional clarification in the manuscript to make the randomization and equivalence procedures more explicit.

 

Comments 13: The authors use a one-shot intervention with immediate post-tests. The short duration and lack of any delayed assessment raise questions about whether the findings reflect stable or merely short-term effects on learning.

Response 13: We thank the reviewer for raising this important concern regarding the short duration of the intervention and the lack of a delayed assessment. Indeed, another reviewer noted the limitation associated with the one-shot post-test design during a previous round of evaluation. As a result, we revised the Discussion section accordingly to acknowledge this limitation explicitly and discuss its implications for interpreting our findings. Specifically, we now state that the results reflect short-term learning effects only and that no conclusions can be drawn regarding long-term retention or stability of knowledge gains. We also emphasize the importance of incorporating delayed post-tests in future studies, especially when evaluating instructional methods involving generative AI or video-based learning, as both may have differing long-term impacts on cognitive retention and metacognitive engagement. By addressing both reviewers’ concerns jointly, we have aimed to provide a transparent and critical reflection on the study’s temporal scope and methodological constraints. The relevant paragraph has been added to the Discussion section.

 

Comments 14: The development and validation process for the pre-test and post-test instruments is not described. The authors should clarify how they ensured that the knowledge tests accurately measured PHP proficiency and possessed reliable psychometric properties.

Response 14: Another reviewer also highlighted this point in an earlier round, and we addressed it by expanding Section 3.4 (Instruments). We now explain that the knowledge tests were constructed from a pool of PHP programming exercises used in previous semesters of the course, ensuring alignment with the module’s learning objectives. Content validity was established through review by two instructors with over five years of experience teaching PHP. Furthermore, we conducted a post-hoc reliability analysis, reporting a Cronbach’s alpha of 0.82, which indicates strong internal consistency. This level of detail appropriately demonstrates that the knowledge tests are content-valid and psychometrically reliable measures of PHP proficiency.

 

Comments 15: The description of the MS Copilot group is not detailed enough to confirm consistent use of the AI tool. The authors should discuss how each student was guided in using Copilot, how many queries they made, and whether they had prior practice with AI-driven tools.

Response 15: We thank the reviewer for raising this important point regarding the consistency of MS Copilot usage across participants. We have addressed this concern in Section 3.3 (Process), where we clarify that students in the Copilot group had prior informal experience using generative AI tools—including MS Copilot—through their institutional Office 365 accounts. While no formal training was delivered immediately before the intervention, the instructional material included structured follow-up questions for each exercise. These questions were designed to prompt students to evaluate the correctness of AI-generated code and encourage critical thinking. However, we acknowledge that we did not systematically record the number of queries or interactions each student had with MS Copilot during the activity. This is a limitation of the study, and we now include a brief note in the Discussion section suggesting that future research should consist of usage tracking or interaction logs to better assess consistency and engagement levels.

 

Comments 16: The manuscript does not report controlling for differences in digital literacy, which may influence how quickly students adapt to AI tools. The authors could measure or acknowledge these potential confounders and explore how they affect the study results.

Response 16: We thank the reviewer for highlighting the potentially confounding role of digital literacy in shaping students’ interaction with AI tools. This issue was considered during the revision process and is now explicitly addressed in the Discussion section. We acknowledge that, although both groups were balanced in terms of prior programming knowledge—as verified through the pre-test—no direct measurement of digital literacy or previous experience with generative AI tools was conducted. To account for this limitation, we discuss how students with similar technical knowledge might still differ significantly in their ability to engage with AI-based tools like MS Copilot due to varying levels of digital fluency. These differences could influence how effectively students formulate queries, interpret AI-generated outputs, or identify inaccuracies—factors that may have introduced uncontrolled variance in the experimental condition. We also recommend that future studies incorporate explicit assessments of digital fluency and possibly prior exposure to AI technologies, as these variables may play a critical role in moderating the effectiveness of AI-driven learning interventions.

 

Comments 17: The ANCOVA and ANOVA methods require assumptions such as normality and homogeneity of variance. The authors should report tests of these assumptions and discuss whether the data met the criteria for valid parametric analysis.

Response 17: This concern had been previously raised by the same reviewer in an earlier comment, and we addressed it accordingly by reporting the tests for normality and homogeneity of variances in the revised version of the manuscript. We confirm that the data met the required assumptions for parametric analysis. The test results and interpretation are now clearly reported in the revised Results section, supporting the validity of the ANCOVA and ANOVA results.

 

 

Comments 18: The rationale behind the choice of the specific HMSAM variables and how they apply to learning to program in PHP is not fully articulated. The authors should explain how constructs like temporal dissociation or immersion specifically apply to this programming context.

Response 18: We thank the reviewer for highlighting the importance of clarifying how the HMSAM variables relate to programming education. This point is addressed in detail in the Theoretical Framework section, where we explain how each construct—such as enjoyment, curiosity, immersion, and temporal dissociation—relates to the specific cognitive and emotional demands of learning to program in PHP. For instance, temporal dissociation and immersion are frequently experienced during intense problem-solving tasks, such as debugging or designing logic structures, which are common in introductory PHP exercises. Control and perceived usefulness are critical when students interact with AI tools that offer real-time feedback. These clarifications were included to show the theoretical and practical relevance of HMSAM to the learning context explored in this study.

 

Comments 19: The claim that instructional videos are more effective may be overstated because there is no evidence about longer-term knowledge retention or continued skill development. The authors are encouraged to discuss potential limitations of short-term assessments.

Response 19: Thank you very much for reiterating this important observation. As you also noted in a previous comment, we recognize that our study's conclusions are based solely on immediate post-test results, which reflect short-term learning gains. In response to your earlier suggestion, we have explicitly addressed this limitation in the Discussion section of the revised manuscript. We acknowledge that, due to the lack of a delayed post-test, we cannot conclude long-term knowledge retention or sustained skill development. As suggested, we now emphasize the need for future studies to include longitudinal follow-up assessments to determine the stability and transferability of the observed effects over time.

 

Comments 20: There is little discussion of how the content in the instructional video group was structured and whether it was more straightforward or thorough than the guidance given by Copilot. The authors might compare the nature or structure of both interventions to determine whether this influenced the results.

Response 20: We thank you for this valuable comment regarding the potential differences in the structure and consistency of the instructional content between the video and Copilot groups. This issue is now explicitly addressed in the Discussion section of the manuscript. We acknowledge that the video intervention offered all participants pedagogically sequenced and uniform explanations. At the same time, the Copilot group relied on self-formulated prompts, resulting in varying depth and accuracy of the AI-generated responses. This lack of standardization in the AI condition may have influenced the differences in learning outcomes observed between the groups. As part of our analysis, we have discussed this point as a relevant limitation and proposed that future studies should control more rigorously for instructional consistency across conditions to ensure comparability.

 

Comments 21: The choice of PHP as the sole language in this study could limit generalization to other programming contexts. The authors should acknowledge that some of the findings might differ if the language or complexity level changed.

Response 21: We thank you for reiterating this important observation regarding the scope and generalizability of our findings. As noted in your earlier comment, we have addressed this point in the Discussion section of the revised manuscript. Specifically, we acknowledge that the exclusive use of PHP may limit the applicability of the results to other programming contexts, as different languages present unique syntactic and conceptual demands. We have also noted that future studies should explore the effects of similar interventions in other languages and levels of complexity to assess better the robustness of the findings across diverse learning environments.

 

Comments 22: The MS Copilot group was given follow-up questions, but the integration of these prompts into the overall learning process is not fully explained. The authors should clarify how these guided questions were used and how they might have influenced the outcomes.

Response 22: Thank you for your observation regarding integrating follow-up questions in the MS Copilot group. As noted in your previous comments, we clarified this point in the revised manuscript. In the Process section, we now explain that these guided prompts were embedded directly within each exercise and were designed to help students critically assess the output provided by MS Copilot. The questions encouraged learners to identify inconsistencies, compare AI-generated solutions with their own expectations, and consider alternative implementations. As discussed, these prompts promoted metacognitive engagement and reduced overreliance on AI, which may have influenced how students approached problem-solving during the intervention.


All changes are marked in yellow within the document.

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

While the authors have addressed several of the previous round’s comments, the revised manuscript still falls short of the scientific and methodological standards required for publication. Given the number and severity of these concerns, I regret that I cannot recommend this manuscript for publication in its current form.

 

 

- The study claims to have used random group allocation via the Moodle platform, but no information is provided regarding allocation concealment, stratification, or verification of balance on critical baseline variables (e.g., gender, prior programming experience, digital literacy). The pre-test only controls for PHP knowledge.

 

- The entire study intervention spanned only seven days, with no follow-up or delayed post-test. This short duration limits any conclusions about retention, transfer, or sustained impact on learning outcomes.

 

- For the MS Copilot group, the number and nature of queries were not tracked. Without this data, it is difficult to assess the consistency or depth of engagement with the tool across participants.

 

- The pre- and post-tests were constructed from instructor-made exercises, with limited psychometric evidence. While post-test reliability (Cronbach’s α = 0.82) is reported, no reliability data for the pre-test are provided, and there is no mention of item analysis or construct validity.

 

- There is a risk that test items were too closely aligned with the instructional materials, especially in the Copilot condition, potentially inflating learning gains through familiarity.

 

- Although Levene’s test was used to assess homogeneity of variance, large differences in post-test standard deviations across groups (2.32 vs. 1.23) suggest unequal variance may still affect the ANCOVA results. No robust methods were used to account for this.

 

- Multiple ANOVA and ANCOVA tests were conducted across eight HMSAM constructs without proper correction for family-wise error (e.g., Bonferroni, FDR). Despite marginal p-values (e.g., p = 0.035), the manuscript discusses these as practically meaningful.

 

 

Minor Comments:

 

  1. Figures 3–5 are low-resolution screenshots and include unreadable code and UI elements; these should be professionally redrawn or replaced with transcribed code and annotations.

 

  1. Terminology inconsistency: The term “MS Copilot” is used throughout instead of “Microsoft Copilot” or its correct product version. Please ensure consistency with official branding.

 

  1. Language and Style: Several long, compound sentences (especially in the Discussion and Conclusion) reduce clarity. Tense inconsistencies should also be corrected (e.g., past tense for completed procedures).

Author Response

Comments 1: The study claims to have used random group allocation via the Moodle platform, but no information is provided regarding allocation concealment, stratification, or verification of balance on critical baseline variables (e.g., gender, prior programming experience, digital literacy). The pre-test only controls for PHP knowledge.

Response 1: We appreciate the reviewer's comments, which have significantly enriched the manuscript. In response to their observation, several enhancements were incorporated to address the noted limitations. Firstly, a detailed explanation was added to Section 3.1 regarding the random assignment process using Moodle's random group function, emphasizing its configuration to conceal group composition and minimize bias, thereby ensuring the integrity of the experimental design. Secondly, a new paragraph was included in Section 6 acknowledging the lack of control over confounding variables, such as digital literacy, prior programming experience, or gender, and proposing the inclusion of instruments to measure these variables in future studies, thus improving the robustness of the design. Finally, a paragraph in Section 5.1 was expanded to emphasize the limitation of not recording the number or nature of interactions with MS Copilot, explaining its potential impact on learning outcomes and recommending interaction logs in future research. These revisions strengthen the transparency and methodological rigor of the study, directly addressing the concerns raised.

 

Comments 2: The entire study intervention spanned only seven days, with no follow-up or delayed post-test. This short duration limits any conclusions about retention, transfer, or sustained impact on learning outcomes.

Response 2: While we recognize that the short duration of the intervention limits our ability to conclude long-term retention or transfer, our study was designed to target well-defined, short-term learning objectives. According to Bloom (1968), and later supported by Conklin (2005) and Guskey (2007), such objectives can be validly addressed through brief, focused instructional interventions. These are especially suitable for foundational cognitive levels, where mastery can be achieved with concise and structured activities. In response to your suggestion, we have incorporated a clarifying statement in the Discussion section to acknowledge this rationale and the need for future longitudinal studies to assess sustained learning outcomes.

 

Comments 3: For the MS Copilot group, the number and nature of queries were not tracked. Without this data, it is difficult to assess the consistency or depth of engagement with the tool across participants.

Response 3: We appreciate the reviewer’s thoughtful observation. Indeed, the absence of interaction logging represents a limitation in our study design. While we did not track the number or nature of individual prompts submitted to MS Copilot, all participants in the AI group followed a structured activity guide that included targeted programming tasks and follow-up questions intended to promote critical evaluation of the AI-generated responses. This guide was designed to scaffold metacognitive engagement and reduce variability in usage patterns. To address the reviewer’s concern, we have explicitly acknowledged this limitation at the end of Section 5.1 (“Learning effects”), emphasizing the lack of empirical usage data and recommending that future research include interaction tracking tools. In addition, we have uploaded the exercise guide used by the MS Copilot group as supplementary material, which can be accessed through the MDPI submission platform.

 

Comments 4: The pre- and post-tests were constructed from instructor-made exercises, with limited psychometric evidence. While post-test reliability (Cronbach’s α = 0.82) is reported, no reliability data for the pre-test are provided, and there is no mention of item analysis or construct validity.

Response 4: We thank the reviewer for this important observation. The pre- and post-tests and the exercise guide were developed from a well-established item bank that has been used consistently over the past four academic years (seven semesters) in the course. These items were created and refined by instructors with over five years of experience teaching PHP and have been applied as formative exercises and assessments across multiple cohorts. While formal psychometric validation was not originally conducted, the repeated application and iterative revision of these items provide evidence of empirical reliability and alignment with the course’s learning objectives. To strengthen the manuscript, we have added information regarding the historical use and curricular alignment of the item bank, and we have acknowledged the limitation related to the lack of item-level psychometric analysis.

 

Comments 5: There is a risk that test items were too closely aligned with the instructional materials, especially in the Copilot condition, potentially inflating learning gains through familiarity.

Response 5: We appreciate the reviewer’s concern regarding the potential overlap between instructional materials and test items. However, we clarify that both groups (Copilot and instructional video) used the same practice guide, containing the same set of exercises. Likewise, the pre-test and post-test were constructed from the same item bank used historically in the course, independent of the delivery method. Therefore, any alignment between the assessment items and the instructional material was applied equally to both conditions. Importantly, the fact that the instructional video group outperformed the Copilot group, despite working with identical exercises and assessments, suggests that learning gains were not due to item familiarity but rather to differences in instructional effectiveness. We have added a clarification to Section 3.4 to address this concern explicitly.

 

Comments 6: Although Levene’s test was used to assess homogeneity of variance, large differences in post-test standard deviations across groups (2.32 vs. 1.23) suggest unequal variance may still affect the ANCOVA results. No robust methods were used to account for this.

Response 6: We thank the reviewer for this observation. Indeed, the post-test standard deviations for the Copilot and Video groups differed (2.32 vs. 1.23), which could raise concerns about variance homogeneity. However, in addition to Levene’s test (which was non-significant), we further examined the coefficient of variation (CV) for both groups. Given that the post-test means differ substantially (5.69 vs. 8.17), the CV provides a more appropriate measure of relative dispersion in this context. The analysis revealed that dispersion was not disproportionately different between groups when adjusted for the mean. Moreover, existing literature supports that ANCOVA remains robust to moderate violations of variance homogeneity, particularly in balanced designs such as ours (35 vs. 36 participants). To strengthen the discussion, we have acknowledged this limitation and proposed implementing simulation-based approaches in future studies to evaluate the robustness of our findings under different variance and distributional assumptions. We have cited foundational and recent studies that address ANCOVA's robustness in such conditions (e.g., Rheinheimer & Penfield, 2001; Poremba & Rowell, 2012). Thank you again for raising this important statistical consideration, which has allowed us to enhance the transparency and methodological rigor of the manuscript.

 

Comments 7: Multiple ANOVA and ANCOVA tests were conducted across eight HMSAM constructs without proper correction for family-wise error (e.g., Bonferroni, FDR). Despite marginal p-values (e.g., p = 0.035), the manuscript discusses these as practically meaningful.

Response 7: We thank the reviewer for highlighting this important methodological consideration. We acknowledge that no correction for family-wise error rate (FWER)—such as Bonferroni or False Discovery Rate (FDR)—was applied to analyze HMSAM constructs. This is a limitation of the current study, which we now explicitly state in Section 4.2. Nonetheless, we emphasize that:

(1) no ANCOVA was applied in this section, since the HMSAM constructs were modeled directly as dependent variables without covariates;

(2) the observed p-values—while in some cases marginal—were accompanied by consistent medium effect sizes (e.g., η² = 0.06 for curiosity, η² = 0.08 for intention to use), suggesting that results are not merely statistical artifacts; and

(3) as noted in applied research, corrections for multiple comparisons often adjust significance thresholds without drastically altering the substantive interpretation of results, especially when effects are systematic and effect sizes are meaningful.

We have revised the manuscript accordingly to acknowledge this limitation and to recommend that future studies systematically implement p-value correction strategies to ensure more conservative inference. Thank you for this valuable suggestion that has allowed us to improve the methodological transparency of the study.

 

Comments 8: Figures 3–5 are low-resolution screenshots and include unreadable code and UI elements; these should be professionally redrawn or replaced with transcribed code and annotations.

Response 8: In response, Figures 3 and 4 have been redesigned and replaced with transcribed, editable code blocks directly embedded in the manuscript. These are now presented in English and formatted in a monospaced style to improve clarity, accessibility, and adherence to publication standards. Regarding Figure 5, we respectfully clarify that it is not an example of student interaction or AI output. Rather, it is a screenshot from the instructional video showing the teacher explaining the code to students. Since it does not include detailed code but serves a documentary purpose, we have decided to retain it in its current form. Nevertheless, we remain open to further adjustments should the editorial team require it.

 

Comments 9: Terminology inconsistency: The term “MS Copilot” is used throughout instead of “Microsoft Copilot” or its correct product version. Please ensure consistency with official branding.

Response 9: We thank the reviewer for noting this important detail. In response, we have carefully revised the manuscript and replaced all “MS Copilot” instances with the correct and official term “Microsoft Copilot” to ensure consistency with the product’s branding. A total of 59 cases were updated accordingly throughout the text.

 

Comments 10: Terminology inconsistency: The term “MS Copilot” is used throughout instead of “Microsoft Copilot” or its correct product version. Please ensure consistency with official branding.

Response 10: We thank the reviewer for noting this important detail. In response, we have carefully revised the manuscript and replaced all “MS Copilot” instances with the correct and official term “Microsoft Copilot” to ensure consistency with the product’s branding. A total of 59 cases were updated accordingly throughout the text.

 

Comments 11: Language and Style: Several long, compound sentences (especially in the Discussion and Conclusion) reduce clarity. Tense inconsistencies should also be corrected (e.g., past tense for completed procedures).

Response 11: We have conducted an editorial revision of the manuscript, with special attention to the Discussion and Conclusion sections.

Back to TopTop