Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Comparative Study of Large Language Models in Programming Education: Accuracy, Efficiency, and Feedback in Student Assignment Grading

Appl. Sci. 2025, 15(18), 10055; https://doi.org/10.3390/app151810055

by Andrija Bernik^1,*

, Danijel Radošević² and Andrej Čep³

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2025, 15(18), 10055; https://doi.org/10.3390/app151810055

Submission received: 19 August 2025 / Revised: 8 September 2025 / Accepted: 11 September 2025 / Published: 15 September 2025

(This article belongs to the Special Issue Emerging Trends in Artificial Intelligence and Computer Science for E-Learning)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

AUTHORS investigate the application of artificial intelligence (AI) tools for preliminary assessment of undergraduate programming assignments.

A multi-phase experimental study was conducted across three computer science courses: Introduction to Programming, Programming 2, and Advanced Programming Concepts.

A total of 315 Python assignments were collected from the Moodle learning management system, with 100 randomly selected submissions analyzed in detail. AI evaluation was performed using ChatGPT-4 (GPT-4-turbo), Claude 3, and Gemini 1.5 Pro models, employing structured prompts aligned with a predefined rubric that assessed functionality, code structure, documentation, and efficiency.

Quantitative results demonstrate high correlation between AI-generated scores and instructor evaluations, with ChatGPT-4 achieving the highest consistency (Pearson coefficient 0.91) and the lowest average absolute deviation (0.68 points).

Qualitative analysis highlights AI’s ability to provide structured, actionable feedback, though variability across models was observed. The study identifies benefits such as faster evaluation and enhanced feedback quality, alongside challenges including model limitations, potential biases, and the need for human oversight. Recommendations emphasize hybrid evaluation approaches combining AI automation with instructor supervision, ethical guidelines, and integration of AI tools into learning management systems.

The findings indicate that AI-assisted grading can improve efficiency and pedagogical outcomes while maintaining academic integrity.

The study is interesting but in its current form requires several improvements before being considered for acceptance.

Comments:

1) Include a line or two of background in the abstract.

2) All supporting scientific literature is included in section 2 (related work). Include some work in the introduction to support the assertions.

3) The methods have four polarities that should be described in greater detail and introduced with a summary first, avoiding short paragraphs, using standard tables, and describing the statistics and other traditional methodological components that seem absent.

4) The results are currently incomprehensible. They appear to be a list of contents that they intend to develop (see “4.1 Quantitative Analysis of Grades 342 Correlation between AI and instructor (Pearson coefficient): 343 ChatGPT-4: 0.91 344 Gemini 1.5: 0.88 345 Claude 3: 0.85 346 347 Average absolute score difference (scale 0–10): 348 ChatGPT-4: 0.68 points 349 Gemini 1.5: 0.79 points 350 Claude 3: 0.94 points 351 352 Percentage of AI grades within ±1 point of instructor grade: 353 ChatGPT-4: 86% 354 Gemini 1.5: 81% 355 Claude 3: 75% 356 357 Evaluation time for 100 assignments (batch processing): 358 ChatGPT-4 (via Code Interpreter): 11 minutes 359 Gemini 1.5 (via AI Studio API): 8 minutes 360 Claude 3 (manual): approx. 90 minutes”)

5) The discussion is missing; not a single comparison reference is provided.

Essentially, there is merit, but the current format doesn't allow for an accurate evaluation. Rewrite it as I suggest, and I'll be happy to reread it.

Author Response

Include a line or two of background in the abstract → Added new first two sentences about the background of the research.

Add supporting literature in the introduction → Added background sentences and references in Introduction (2nd paragraph); expanded Section 2 (Related Work).

Methods: more detail, summary first, tables, statistics → Added overview of 4 phases in Section 3 (Methodology, 1st paragraph); more detail on statistics in 3.3 AI-Based Evaluation; used Tables 1 and 2.

Results incomprehensible (just a list) → Rewritten in Section 4.1, added Table 3 and explanatory narrative.

Discussion missing, no comparisons → Added comparative paragraph with references (Mehta, Jukiewicz, Aytutuldu) in Section 5 (Discussion).

Reviewer 2 Report

Comments and Suggestions for Authors

Dear Authors,

The manuscript "A Comparative Study of Large Language Models in Education: Accuracy, Efficiency, and Feedback in Student Assignment Grading" investigates the application of large language models (ChatGPT-4, Claude 3, Gemini 1.5) for automated assessment of student programming assignments. The study makes a significant contribution to understanding the practical application of AI in education and may serve as a foundation for developing standardized approaches to automated assessment in higher education institutions.

Nevertheless, the manuscript in its present form exhibits certain areas that require attention to strengthen its scholarly rigor, conceptual clarity, and applicability to educational practice. Addressing the observations and recommendations outlined below could enhance the work’s clarity, scientific value, and practical significance.

The manuscript title warrants reconsideration, as the current formulation suggests broader coverage of AI educational applications. Since the work focuses exclusively on programming assignment assessment, a more specific title would better reflect its actual scope and content.
Including statistical significance tests for comparing mean scores between models, along with inter-rater reliability analysis, would strengthen the empirical foundation of the study and enable more substantiated conclusions regarding the advantages of individual models.
Adding the data parsing and validation script (line 296), for example, in appendices, could ensure reproducibility of the research results.
I recommend supplementing the work with a more detailed analysis of ethical aspects, particularly issues of student data privacy and impact on academic integrity. This could enhance the practical value of the research findings for educational institutions considering the implementation of such systems.
The study results should be compared in the section "Discussion and Recommendations" with findings from other authors who have investigated the use of AI for evaluating student assignments. This would help better understand how the obtained data aligns with general trends in this field.
The assertion that AI provides “high-quality feedback” (line 419) in student code assessment requires critical reconsideration. Automated assessment has several significant limitations. For instance, it cannot adequately evaluate solution creativity, architectural approach selection, contextual algorithm appropriateness, and originality of thinking. These competencies are fundamental in programming. Moreover, AI assessment may contain algorithmic biases and errors. Additionally, traditional instructor assessment is not free from subjectivity. Contemporary students actively use AI tools for code writing, raising questions about the authenticity of their work and the validity of the assessment process itself.
The conclusions are too general and do not fully reflect the obtained results. It is recommended to make the findings more specific, emphasizing the quantitative performance indicators and practical aspects of applying different AI models in the educational process.

Best regards,

Reviewer

Author Response

The manuscript title warrants reconsideration, as the current formulation suggests broader coverage of AI educational applications. Since the work focuses exclusively on programming assignment assessment, a more specific title would better reflect its actual scope and content.

→ Title revised to “…in Programming Education” (Title page).

Including statistical significance tests for comparing mean scores between models, along with inter-rater reliability analysis, would strengthen the empirical foundation of the study and enable more substantiated conclusions regarding the advantages of individual models.

→ correlations and score differences reported.

Adding the data parsing and validation script (line 296), for example, in appendices, could ensure reproducibility of the research results.

→ described in Section 3.1 and mentioned in Data Availability.

I recommend supplementing the work with a more detailed analysis of ethical aspects, particularly issues of student data privacy and impact on academic integrity. This could enhance the practical value of the research findings for educational institutions considering the implementation of such systems.

→ New ethics paragraph in Section 5 (Discussion).

The study results should be compared in the section "Discussion and Recommendations" with findings from other authors who have investigated the use of AI for evaluating student assignments. This would help better understand how the obtained data aligns with general trends in this field.

→ Comparative discussion added in Section 5 (Discussion).

The assertion that AI provides “high-quality feedback” (line 419) in student code assessment requires critical reconsideration. Automated assessment has several significant limitations. For instance, it cannot adequately evaluate solution creativity, architectural approach selection, contextual algorithm appropriateness, and originality of thinking. These competencies are fundamental in programming. Moreover, AI assessment may contain algorithmic biases and errors. Additionally, traditional instructor assessment is not free from subjectivity. Contemporary students actively use AI tools for code writing, raising questions about the authenticity of their work and the validity of the assessment process itself.

→ Limitations added (creativity, originality, bias, authenticity issues) in Section 5 (Discussion, last paragraph).

The conclusions are too general and do not fully reflect the obtained results. It is recommended to make the findings more specific, emphasizing the quantitative performance indicators and practical aspects of applying different AI models in the educational process.

→ Conclusions revised: model-specific results highlighted and practical recommendations added in Section 6 (Conclusions).

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Reviewer 2 Report

Comments and Suggestions for Authors

Dear Authors,

Thank you for your diligent work on improving the article. You have addressed all previous comments and made important revisions to the text. These changes have significantly improved the clarity of the material presentation, facilitated understanding of the main research findings, and made your scientific conclusions more convincing and well-founded. The article is now fully ready for publication and meets all MDPI standards.

Best regards,

Reviewer

Article Menu

A Comparative Study of Large Language Models in Programming Education: Accuracy, Efficiency, and Feedback in Student Assignment Grading

Further Information

Guidelines

MDPI Initiatives

Follow MDPI