A Comparative Study of Large Language Models in Programming Education: Accuracy, Efficiency, and Feedback in Student Assignment Grading

Bernik, Andrija; Radošević, Danijel; Čep, Andrej

doi:10.3390/app151810055

Open AccessArticle

A Comparative Study of Large Language Models in Programming Education: Accuracy, Efficiency, and Feedback in Student Assignment Grading

by

Andrija Bernik

^1,*

,

Danijel Radošević

² and

Andrej Čep

³

¹

Department of Multimedia, University North, 42000 Varaždin, Croatia

²

Faculty of Organization and Informatics, University of Zagreb, 42000 Varaždin, Croatia

³

Inpro d.o.o., Department for Systems Implementation, 40000 Čakovec, Croatia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 10055; https://doi.org/10.3390/app151810055

Submission received: 19 August 2025 / Revised: 8 September 2025 / Accepted: 11 September 2025 / Published: 15 September 2025

(This article belongs to the Special Issue Emerging Trends in Artificial Intelligence and Computer Science for E-Learning)

Download Review Reports Versions Notes

Abstract

Programming education traditionally requires extensive manual assessment of student assignments, which is both time-consuming and resource-intensive for instructors. Recent advances in large language models (LLMs) open opportunities for automating this process and providing timely feedback. This paper investigates the application of artificial intelligence (AI) tools for preliminary assessment of undergraduate programming assignments. A multi-phase experimental study was conducted across three computer science courses: Introduction to Programming, Programming 2, and Advanced Programming Concepts. A total of 315 Python assignments were collected from the Moodle learning management system, with 100 randomly selected submissions analyzed in detail. AI evaluation was performed using ChatGPT-4 (GPT-4-turbo), Claude 3, and Gemini 1.5 Pro models, employing structured prompts aligned with a predefined rubric that assessed functionality, code structure, documentation, and efficiency. Quantitative results demonstrate high correlation between AI-generated scores and instructor evaluations, with ChatGPT-4 achieving the highest consistency (Pearson coefficient 0.91) and the lowest average absolute deviation (0.68 points). Qualitative analysis highlights AI’s ability to provide structured, actionable feedback, though variability across models was observed. The study identifies benefits such as faster evaluation and enhanced feedback quality, alongside challenges including model limitations, potential biases, and the need for human oversight. Recommendations emphasize hybrid evaluation approaches combining AI automation with instructor supervision, ethical guidelines, and integration of AI tools into learning management systems. The findings indicate that AI-assisted grading can improve efficiency and pedagogical outcomes while maintaining academic integrity.

Keywords:

AI-assisted grading; programming education; large language models; ChatGPT; automated feedback; educational technology

1. Introduction

Artificial Intelligence (AI) is increasingly applied in higher education, particularly in STEM disciplines, where automation and analytical support can substantially reduce instructional workload and enhance student learning experiences. In programming courses, the assessment of student assignments traditionally requires considerable time, expert code evaluation, and additional communication with students for clarification or corrections. With the growing availability of advanced generative models and tools for text and code processing, the opportunity emerges for AI-assisted preliminary grading, enabling real-time analysis, feedback, and classification of student work. In higher education, especially with growing class sizes, automated grading systems are increasingly seen as necessary to handle large volumes of assignments and reduce the burdens on instructors. At the same time, educators recognize the potential of AI to enhance assessment processes, suggesting that instead of banning such tools, they should be leveraged to improve grading consistency and feedback quality.

This study presents an experimental approach to employing multiple AI tools and platforms for the preliminary evaluation of programming assignments in undergraduate courses Introduction to Programming, Programming 2, and Advanced Programming Concepts. The experiment combines generative large language models (LLMs), including ChatGPT with analytical extensions (DeepResearch, Code Interpreter/Data Analyst), Claude.ai, Google AI Studio with Gemini models, as well as custom Python scripts for code evaluation. Special attention is devoted to the integration of Google AI Studio, which supports contextually extensive queries (up to 1,000,000 tokens) and connectivity with tabular data (e.g., Excel), thereby enabling additional flexibility in grading and analysis.

For the purposes of the experiment, a collection of Python-based assignments completed by students during regular coursework was compiled. The assignments were exported from the Moodle system in CSV/JSON format and subsequently processed by AI systems. Evaluation was conducted in three phases: (1) code analysis by LLMs according to predefined criteria (functionality, correctness, style, comments), (2) comparison with instructor-assigned grades, and (3) quantitative and qualitative analysis of deviations and grading accuracy.

The scientific contribution of this study is threefold. First, it introduces a novel methodology for semi-automated assessment of student work through AI systems integrated with both open-source and commercial tools. Second, it demonstrates the application of fine-tuned and contextually enriched AI models in an authentic educational environment, including prototype scripts for code processing and evaluation. Third, it examines challenges and limitations, such as the reliability of LMS data exports, technical constraints of AI tools (API limitations, quotas, privacy concerns), and the necessity of manual validation.

Additionally, the experiment collected data on the effectiveness of detecting common coding errors (e.g., faulty loops, uninitialized variables, inefficient algorithms), the processing speed per assignment, and the overall time savings compared to manual grading. Particular emphasis is placed on tools that support zero-shot and few-shot learning, which allow for the generation of grades and feedback without extensive manual data labeling.

The remainder of the paper details the employed technologies, system architecture, experimental methodology, evaluation criteria, results, and their interpretation. Furthermore, the study discusses scalability to other courses and institutions, as well as the ethical and transparency considerations associated with the educational use of AI systems.

2. Related Work and Review of Existing AI Tools in Education

Recent studies have shown that LLMs can approximate human-level grading accuracy in essay scoring. C. Impey et al. [1] in their study, used GPT-4 to grade short science essays in MOOCs, finding that its scores matched instructor grades and even exceeded the reliability of peer grading. This suggests LLMs may enable automated yet credible scoring at scale. Similarly, M. Usher [2] compared ChatGPT-4 with professors and peers in a university course assessment. In his study, M. Usher reported that ChatGPT-4 tended to grade more leniently than human assessors but offered more extensive feedback, while peer and instructor scores were lower and closely aligned. On the other hand, C. Grévisse [3] found GPT-4′s grading to be significantly stricter than a human teacher’s, giving lower scores on average. Nonetheless, GPT-4 showed high precision in identifying fully correct answers with few false positives. The study also tested Gemini 1.5, another LLM, which on average graded similarly to the human but exhibited greater variability on certain questions. [3]. Additionally, B. Quah et al. [4] evaluated ChatGPT in grading dental students’ essays. They reported a moderate correlation with human scores and noted that ChatGPT’s average grades were slightly lower than instructors.

One of the central themes of published studies is whether AI graders can reliably mimic human judgment and what biases or differences exist. M. Lundgren in his study [5] found that GPT-4 matched mean scores of human graders but lacked sensitivity to rubric nuances, thus avoiding very high or low scores and had low inter-rater reliability with humans. R. Mok et al. [6] conducted a “hands-on test” using LLMs to grade university physics exam answers. In their study they demonstrated that LLMs graded straightforward physics answers well, but struggled with diagrams and complex reasoning. C. Xiao et al. [7] proposed a dual-process system combining LLMs and humans, through a human-AI collaborative scheme in which an open-source LLM grader is used alongside human graders in a second-language essay course. Their system provided automated score and high quality feedback, while greatly improving grading consistency, when novice and expert instructors co-graded with the AI. Such findings correlate with the findings of M. Usher [2] study and conclusion that AI can be integrated as a “second pair of eyes” thus improving grading consistency.

In programming education, LLMs like ChatGPT show promise in evaluating code quality and offering feedback. Traditional auto-graders mostly check functional correctness, but LLM-based approaches aim to evaluate code qualitatively, by reviewing style, efficiency and documentation while also providing constructive feedback. E. Q. Tseng et al. [8] introduced CodEv, an automated code grading framework that uses multiple LLMs with chain-of-thought prompting to assess code on multiple dimensions. Codev provided human-aligned code reviews and feedback by not only generating grades, but also providing textual comments on code readability, structure, and maintainability. Additionally, A. Mehta et al. [9] in their study specifically evaluated ChatGPT as a virtual teaching assistant in an introductory programming course. They compared ChatGPT’s grading of student code against human educators, and also analyzed the feedback it gave. Their findings show that ChatGPT was able to apply the grading rubric correctly in most cases, and its point deductions correlated strongly with the educators’, thus indicating that it identified similar errors or omissions. Moreover, M. Jukiewicz’s study [10] reinforces these previously mentioned findings. In his study, over a 15-week pilot study that incorporated nine programming assignments were dual-graded by the instructor and ChatGPT. On average, the human instructor tended to assign slightly higher scores than ChatGPT, but there was a strong positive correlation between their grades, demonstrating consistent ranking of student performance by the AI. I. Aytutuldu et al. [11] utilizied an AI-powered assessment tool AI-PAT, which leverages ChatGPT and Gemini models to both grade and handle grading appeals in computer science courses. In their study they found the AI models’ scores to be strongly correlated with each other and with human judgments, though different prompt setups could cause score variability. On the other hand, A. Mangalur et al. [12], in their study, proposed a framework combining adaptive intelligence and analytics for continuous student evaluation. Their system uses a generative AI, utilizing Gemini API, to generate customized quiz questions, auto-grade responses, and instantly feedback with explanations.

Beyond grading, a critical role of assessment is to give students feedback that helps them improve. Here, researchers have explored AI as a feedback generator and compared it with traditional feedback sources, such as teacher or peers. A recurring finding is that AI feedback can be as effective as human feedback in supporting learning outcomes, and students often appreciate its clarity and detail, but the best results may come from combining AI and human input. For instance, in their research J. Escalante et al. [13] studied English learners, primarily EFL students, receiving writing feedback either from an instructor or from ChatGPT. In their study, they found out that after an 8-week writing intervention, both groups improved similarly in their writing proficiency, with no significant difference in gains between AI-generated feedback and human feedback. Similarly, S.K. Banihashem et al. in their study [14] compared peer-generated versus ChatGPT-generated feedback on university students’ argumentative essays. They found a significant difference in the nature of feedback, ChatGPT’s comments were more descriptive, focusing on how the essay was written, primary focus on organization, style, and coherence, whereas peer feedback was more problem-focused, identifying issues or gaps in arguments. B. Quah et al. [4] in their study concluded that the level of detail and personalization possible with LLM’s is their major advantage. In their research they observed that ChatGPT could deliver individualized, criterion-based feedback to every student essay, something infeasible for a single instructor in a large audience.

The advent of ChatGPT and similar AI has sparked a reevaluation of assessment strategies in higher education. Researchers are not only testing AI’s performance, but also discussing policy and implementation issues that arise when adopting AI for grading. In their study, I. S. Chaudhry et al. [15] argued for redesigning assessments in the ChatGPT era to emphasize skills beyond AI replication, while also recommending that universities redesign evaluations to emphasize skills that AI cannot simply replicate. Moreover, R. Kumar [16] in his study outlined the benefits and ethical concerns of using AI for grading, including issues of transparency and privacy, while also proposing numerous benefits of AI grading, such as discretion, consistent feedback, and elimination of human grader bias or fatigue. G. Ilieva et al. [17] proposed a framework for AI-driven assessment aligned with pedagogical goals and quality assurance. In their study, they emphasized iterative validation and faculty involvement where key point is that faculty and students must be engaged in co-designing AI-integrated assessment, to ensure buy-in and to surface any issues of trust or ethics early. Finally, in their study S. Elsayed and D. Cakir [18] argued for rethinking feedback practices, suggesting AI can support more structured and actionable feedback, while also providing evidence that many students still do not find traditional feedback useful, often because it’s too vague, too late, or not actionable. Similarly, A. Smolansky et al. [19], in their study, surveyed educators and students across universities regarding generative AI in assessment. They found out that both groups agreed that assessment practices must evolve; rather than banning AI, it should be leveraged. In practical terms, they concluded that this might mean using AI to grade routine aspects, such as grammar and factual content, while instructors grade creativity or reasoning, a partition that preserves human judgment where it’s most needed.

A. Čep et al. [20] employed a structured review methodology to analyze the integration of large language models (LLMs), particularly GPT models, in video game narrative design. Using clearly defined selection and analysis criteria, they systematically categorized recent experimental and case studies to evaluate the effectiveness of GPT models across various narrative elements, from dialogues to full storylines. Their work contributes scientifically by synthesizing state-of-the-art applications of LLMs in interactive storytelling and highlighting actionable insights for future integration across diverse game genres. Additionally, by assessing the benefits and limitations of GPT-driven narratives, the study provides guidance for both game designers and AI researchers in optimizing player experience through generative AI.

2.1. General Web-Based AI Tools

Within the scope of preliminary assessment of student programming assignments, several general-purpose AI tools with web-based interfaces were analyzed. These platforms require no additional installation and allow fast, accessible interaction through a standard web browser. The tools tested included ChatGPT 4.5 (OpenAI), Claude 3.5 (Anthropic), Perplexity Pro, DeepSeek R1, and Minimax ver. 01. All of these are based on advanced large language models (LLMs), differing in architecture, context length, code handling capabilities, and their ability to perform complex analytical operations.

ChatGPT proved to be the most effective tool for this project, primarily due to its integrated extensions such as DeepResearch and Data Analyst (also known as Code Interpreter). These allow CSV file uploads, data analysis, and complex tabular operations within an interactive interface. In the educational context, ChatGPT was employed for automated review and refinement of instructional materials, including evaluation of task clarity, logical consistency of instructions, and linguistic and grammatical accuracy. Through several iterations, assignment descriptions were refined into content that was pedagogically consistent, cognitively aligned with course level, and linguistically optimized for students whose first language is not English.

Moreover, the Data Analyst functionality enabled the loading and analysis of CSV files containing student results throughout the semester. This facilitated the generation of dynamic reports, including identification of high-performing students (potential candidates for exemption from partial exams), detection of anomalies (sudden changes in performance), and calculation of descriptive statistics (mean, median, standard deviation per assignment). This significantly accelerated and objectified pedagogical decision-making during the semester.

Claude.ai, based on Anthropic’s proprietary LLM architecture (Claude 3.0 series), was used when processing longer inputs (over 100k tokens), such as entire repositories of student code with comments. Its strength lies in maintaining consistency across the evaluation of multiple assignments within a single conversational context. Perplexity proved useful for quickly generating contextual summaries and comparing student solutions against best practices, while DeepSeek was tested for its declared proficiency with multiple programming languages and technically precise responses in C/C++ 20, Python 3.10, and JavaScript ECMAScript 2024 (ES2024). Minimax, though limited in availability and functionality, was used experimentally as a contrast to the other tools, particularly in analyzing commented code.

Despite their advanced capabilities, these tools still require manual validation of outputs, as errors were observed in interpreting complex semantic relationships within code, as well as cases of oversimplification. Additional challenges include limitations of free versions, usage quotas, and concerns regarding privacy and protection of student data when processed on commercial platforms.

In conclusion, web-based AI tools demonstrate strong potential for supporting instructors in material preparation, results analysis, and preliminary assessment. However, their integration into educational practice requires a carefully designed framework, including clear evaluation guidelines, adherence to ethical standards, and continuous quality monitoring.

2.2. Limitations of Web-Based Tools

Unlike generic web-based AI tools, which are primarily optimized for direct user interaction, advanced applications of AI in education require deeper integration of models within automated scripts and data-processing systems. In this experiment, Google AI Studio (Gemini 2.5 Flash model) was employed as a development environment enabling direct interaction with various Gemini model versions (1.5, 2.0, and 2.5). It supports query execution via Python scripts, processing of text and code inputs, and connectivity with external data sources such as Google Sheets or Excel files.

A key advantage of Gemini models is their support for extremely large context windows—up to 1,000,000 tokens in some cases—which allows simultaneous analysis of multiple student submissions, accompanying task descriptions, LMS-exported CSV files, and grading rubrics. This reduces the need to fragment input data, thereby mitigating the risk of losing semantic context.

Python scripts were designed to automate the process of querying Gemini models through the API, leveraging functions for:

importing student submissions from CSV format (Moodle exports),
structural and semantic code analysis (error detection, comments, code organization),
classification of solutions according to predefined rubrics (e.g., 0–5 points per dimension: correctness, efficiency, readability, comments),
generating student feedback in the form of textual recommendations for improvement.

Each query incorporated prompt engineering techniques, augmenting the input prompt with examples of correct and incorrect solutions (few-shot learning) and specifying the expected output format, such as a JSON structure with fields for grade, comment, and detected errors.

Beyond classification, Gemini models were also used for error pattern detection, e.g., recurring issues in the use of for and while loops, incorrect variable initialization, or misconceptions regarding function execution order. Using the models’ “explanation” modules, textual justifications were generated for problematic code segments, which holds pedagogical value in providing actionable feedback.

Additionally, an experimental batch-assessment workflow was tested, in which Gemini models graded large sets of assignments using predefined scripts that imported tasks, generated responses, and recorded scores into Excel files. This approach allowed for high-volume evaluation (hundreds of submissions per run) and facilitated comparison with instructor grading for accuracy validation.

Challenges observed included:

access limitations and usage quotas (especially with free Google accounts),
response latency for large inputs, where processing could take several minutes per assignment,
unreliability of LMS exports, which often produced semi-structured or inconsistent CSV files.

Despite these challenges, the advantages are significant. AI tools integrated into automated scripts enable higher levels of automation, reproducibility, and flexibility in assessment, while also opening possibilities for fine-tuning with real student data and grades as training sets in future iterations. Importantly, the ability to generate personalized feedback enhances transparency and student motivation, while substantially reducing instructor workload.

3. Methodology

The experiment was structured into four main phases: (1) Data preparation and export—collecting and anonymizing student code submissions from the LMS; (2) Definition of evaluation criteria—developing a grading rubric (functionality, structure, documentation, efficiency) and example-guided prompts; (3) AI-based evaluation and result recording—using the three LLMs to grade assignments according to the rubric and storing their scores and feedback; and (4) Comparative analysis—comparing the AI-generated grades to instructor-assigned grades for agreement and accuracy.

To examine the effectiveness and accuracy of AI-assisted preliminary grading of student programming assignments, a multi-phase experimental study was conducted across three undergraduate computer science courses: Introduction to Programming, Programming 2, and Advanced Programming Concepts. The experiment was structured into four main phases: (1) data preparation and export, (2) definition of evaluation criteria, (3) AI-based evaluation and result recording, and (4) comparative analysis with instructor grading.

3.1. Data Collection and Processing

The dataset of student assignments was collected from the Moodle learning management system at the University North. The assignments were written in Python and submitted either as plain text or .py files. For experimental purposes, the data were exported into CSV and JSON formats containing the following attributes:

anonymized student ID; solution source code; task identifier; instructor-assigned grade; instructor feedback (where available); submission timestamp.

To ensure consistency and structural integrity, a parsing and validation script was applied. This removed incomplete records and normalized code indentation, preventing syntactic irregularities from affecting AI models.

3.2. Definition of Evaluation Criteria

Student solutions were graded according to four rubric dimensions from Table 1.

The maximum score per assignment was 10 points. AI models were instructed via prompt to generate both a score and justification for each category, with the final grade obtained as the sum of components.

3.3. AI-Based Evaluation

Two groups of tools were employed for code analysis:

General web-based AI tools (ChatGPT 4-turbo, Claude, Perplexity) in interactive mode, with inputs manually submitted for validation and comparison of responses.
Python scripts integrated with Google AI Studio and Gemini models, enabling batch processing of assignments and structured recording of results.

The alignment between AI-generated scores and instructor scores was measured using Pearson correlation coefficients. Pearson’s r captures the linear relationship between the AI’s scores and the instructor’s scores. We also computed the absolute difference in points (on a 0–10 scale) for each assignment to quantify the magnitude of score discrepancies.

For each assignment, the prompt included:

task description (from Moodle); student source code; rubric definition; few-shot grading examples.

To ensure consistency, a subset of 100 assignments was cross-validated against instructor grading. For each solution, the following measures were examined:

absolute score difference (AI vs. instructor); correlation of grades (Pearson and Spearman coefficients); agreement on functional correctness (validated through test cases where applicable).

3.4. Result Analysis and Validation

Evaluation metrics were divided into quantitative and qualitative categories as shown in Table 2.

Additionally, task complexity was analyzed as a moderating factor in AI performance. Results indicated that models achieved higher accuracy in grading simpler algorithmic tasks (e.g., loops, conditionals), but encountered greater difficulty with assignments involving multiple files, modularization, or exception handling.

4. Case Study: Preliminary Grading of Student Assignments

A total of 315 student assignments were processed across three undergraduate courses, with 100 randomly selected assignments subjected to detailed analysis and comparison with instructor grades. Grading was performed using ChatGPT-4 (GPT-4-turbo), Claude 3 Sonnet, and Gemini 1.5 Pro models, applying serial prompts and API access where available. Results are presented through quantitative and qualitative analyses.

4.1. Quantitative Analysis of Grades

Correlation between AI and instructor (Pearson coefficient):

ChatGPT-4: 0.91
Gemini 1.5: 0.88
Claude 3: 0.85

Average absolute score difference (scale 0–10):

ChatGPT-4: 0.68 points
Gemini 1.5: 0.79 points
Claude 3: 0.94 points

Percentage of AI grades within ±1 point of instructor grade:

ChatGPT-4: 86%
Gemini 1.5: 81%
Claude 3: 75%

Evaluation time for 100 assignments (batch processing):

ChatGPT-4 (via Code Interpreter): 11 min
Gemini 1.5 (via AI Studio API): 8 min
Claude 3 (manual): approx. 90 min

The results (Table 3) indicate a high level of consistency and accuracy for the GPT-4 model compared with the others, with a relatively small average deviation and a high proportion of grades within ±1 point.

As shown in Table 3, ChatGPT-4 achieved the highest correlation and lowest error in scores compared to the instructor, with r = 0.91 and an average deviation of only 0.68 points. It also had 86% of its grades within ±1 point of the instructor’s grade, indicating close alignment. Gemini 1.5 was similarly accurate (r = 0.88) and notably the fastest, processing 100 assignments in about 8 min. Claude 3’s performance was a bit more variable (lower correlation and about 75% of its grades within ±1 point), and it required substantially more time when used without batch automation.

4.2. Quality of Feedback

All models generated textual justifications for the assigned grades. ChatGPT and Gemini, in particular, demonstrated:

Structured comments according to rubric dimensions (functionality, style, comments, efficiency),
Clear identification of errors (e.g., “the function returns a result but ignores user input”),
Suggestions for corrections in the form of pseudocode or alternative solutions.

Claude was less consistent: although in some cases it provided highly detailed insights, it often diverged from the rubric and produced subjective or overly general comments (e.g., “the code looks fine” without further elaboration).

5. Discussion and Recommendations

The implementation of AI tools in preliminary grading of student assignments offers several advantages, but also presents challenges that must be carefully addressed. One of the most significant benefits is a considerable acceleration of the evaluation process, enabling instructors to focus on higher-quality pedagogical work and individualized student support. AI tools, especially those capable of deep code analysis and contextual understanding, can detect common programming errors, logical inconsistencies, and stylistic shortcomings, thereby further improving the quality of feedback. This also opens the possibility of early identification of educational issues, such as conceptual difficulties or insufficient understanding of key programming constructs, which is crucial for timely intervention.

On the other hand, there are substantial risks associated with over-reliance on AI systems. Models may sometimes produce inaccurate or inadequate grades due to inherent limitations of the training dataset, as well as the complexity and creativity of programming tasks. Technical limitations include API capacities, restricted context windows in some models, and unpredictable performance in cases of more complex tasks. Algorithmic transparency and the way AI reaches its conclusions represent an additional challenge, particularly in academic settings where justification of grades and protection against bias are essential.

Based on the conducted experiment, the development of hybrid grading models is recommended, combining the strengths of AI systems with necessary expert oversight from instructors. Such an approach offers the best of both worlds—automation and speed along with human judgment and ethical responsibility. Furthermore, it is important to establish clear ethical guidelines for the use of AI in education, including data privacy, academic integrity, and transparency of grading.

Due to challenges associated with proprietary AI platforms, a stronger orientation toward open-source and interoperable systems is recommended, which allows greater control, customization, and integration within existing educational ecosystems. Further development should include the standardization of evaluation metrics for AI systems, API development for easier LMS integration, and continuous assessment of AI tools’ impact on educational outcomes and student motivation through empirical studies.

Ultimately, the integration of AI into the grading process represents a paradigmatic shift in educational practice, which, if properly guided, can significantly enhance the quality and efficiency of higher education.

These results align with recent studies exploring AI-assisted grading of programming assignments. Mehta et al. [9] found that ChatGPT could apply a grading rubric in an introductory programming course with high fidelity, and its point deductions were strongly correlated with those of human instructors. Similarly, Jukiewicz [10] reported a strong positive correlation between ChatGPT’s grades and an instructor’s grades across multiple coding assignments, with instructors only slightly more generous on average. In another study, Aytutuldu et al. [11] deployed a ChatGPT/Gemini-based grading system in computer science courses and observed that the AI models’ scores were strongly correlated both with each other and with human judgments. These parallel findings suggest that large language models can replicate human grading patterns for code assignments to a remarkable degree, reinforcing the credibility of our approach. Minor differences do exist—for instance, as Jukiewicz noted and our results hinted, the AI may sometimes grade slightly stricter than instructors—but overall the ranking of student performance by the AI was consistent with human assessment in both their study and ours.

Moreover, any AI-driven grading system must safeguard student data privacy, ensuring that code submissions and feedback remain secure and compliant with institutional policies. AI graders also have inherent limitations—they cannot truly evaluate the creativity or originality of a student’s solution, nor the rationale behind architectural design choices. As a result, an AI might overlook innovative approaches or fail to detect when a student’s work isn’t authentically their own. These factors, combined with potential algorithmic biases in LLMs, underscore the importance of human oversight and a balanced approach to AI use in grading.

6. Conclusions

The experiment conducted in this study demonstrated that modern artificial intelligence tools, particularly large language models such as ChatGPT-4, Gemini 1.5, and Claude 3, can significantly enhance the process of preliminary grading of student programming assignments. The greatest contribution is observed in the initial code evaluation phase, detection of syntactic and logical errors, and generation of structured comments that provide students with high-quality feedback.

ChatGPT-4, when used with the Data Analyst module, proved to be the most reliable in terms of grading consistency, quality of explanations, and detection of edge cases. Gemini 1.5 also offers high accuracy and exceptional processing speed, especially when accessed via the AI Studio platform. Claude 3, while promising, exhibited greater variability in results, necessitating additional validation in educational contexts.

In the context of teaching practice, AI systems are recommended for:

Rapid preliminary grading of large student groups,
Generating personalized feedback for simpler assignments,
Analyzing trends in student performance throughout the semester (e.g., for identifying candidates for partial exam exemptions),
Automating certain administrative processes, such as aggregation and classification of grades.

However, the use of AI systems cannot fully replace expert evaluation. Human oversight is essential to ensure accuracy, ethical compliance, and pedagogical quality of grading. Additionally, investment is required to build robust data processing infrastructure and to ensure alignment with privacy policies and academic integrity standards.

For future application, the following are recommended:

Integration of AI models into LMS platforms (e.g., Moodle) via plugins enabling local processing and audit trails of grades,
Development of scripts for preprocessing and validating student code before submission to AI models,
Training teaching staff in formulating optimal prompts and interpreting AI results,
Further research on the impact of AI grading on student motivation and learning outcomes through longitudinal studies and controlled experiments.

In conclusion, the results indicate that it is possible to develop reliable, efficient, and scalable AI-assisted grading systems in education, provided that careful design, expert oversight, and clear pedagogical methodology are ensured.

Author Contributions

Conceptualization, A.B. and D.R.; methodology, A.B. and D.R.; software, D.R.; validation, A.B., D.R. and A.Č.; formal analysis, A.B.; investigation, D.R.; resources, D.R.; writing—original draft preparation, D.R.; writing—review and editing, A.B.; visualization, A.B.; supervision, D.R.; funding acquisition, A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All research data could be provided via email darados@foi.hr.

Conflicts of Interest

The authors have reviewed and edited the output and take full responsibility for the content of this publication. The authors declare no conflicts of interest.

References

Impey, C.; Wenger, M.; Garuda, N.; Golchin, S.; Stamer, S. Using Large Language Models for Automated Grading of Student Writing about Science. Int. J. Artif. Intell. Educ. 2025, 1–35. Available online: https://link.springer.com/article/10.1007/s40593-024-00453-7 (accessed on 13 September 2025). [CrossRef]
Usher, M. Generative AI vs. instructor vs. peer assessments: A comparison of grading and feedback in higher education. Assess. Eval. High. Educ. 2025, 50, 912–927. Available online: www.tandfonline.com/doi/full/10.1080/02602938.2025.2487495 (accessed on 13 September 2025). [CrossRef]
Grévisse, C. LLM-based automatic short answer grading in undergraduate medical education. BMC Med. Educ. 2024, 24, 1060. Available online: https://bmcmededuc.biomedcentral.com/articles/10.1186/s12909-024-06026-5 (accessed on 13 September 2025). [CrossRef] [PubMed]
Quah, B.; Zheng, L.; Sng, T.J.H.; Yong, C.W.; Islam, I. Reliability of ChatGPT in automated essay scoring for dental undergraduate examinations. BMC Med. Educ. 2024, 24, 962. Available online: https://link.springer.com/article/10.1186/s12909-024-05881-6 (accessed on 13 September 2025). [CrossRef]
Lundgren, M. Large Language Models in Student Assessment: Comparing ChatGPT and Human Graders. arXiv 2024, arXiv:2406.16510. [Google Scholar] [CrossRef]
Mok, R.; Akhtar, F.; Clare, L.; Li, C.; Ida, J.; Ross, L.; Campanelli, M. Using AI Large Language Models for Grading in Education: A Hands-On Test for Physics. arXiv 2024, arXiv:2411.13685. [Google Scholar] [CrossRef]
Xiao, C.; Ma, W.; Song, Q.; Xu, S.X.; Zhang, K.; Wang, Y.; Fu, Q. Human-ai collaborative essay scoring: A dual-process framework with llms. In Proceedings of the 15th International Learning Analytics and Knowledge Conference, Dublin, Ireland, 3–7 March 2025; pp. 293–305. [Google Scholar] [CrossRef]
Tseng, E.Q.; Huang, P.C.; Hsu, C.; Wu, P.Y.; Ku, C.T.; Kang, Y. CodEv: An Automated Grading Framework Leveraging Large Language Models for Consistent and Constructive Feedback. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; IEEE: New York, NY, USA, 2024; pp. 5442–5449. Available online: https://arxiv.org/abs/2501.10421 (accessed on 13 September 2025).
Mehta, A.; Gupta, N.; Balachandran, A.; Kumar, D.; Jalote, P. Can chatgpt play the role of a teaching assistant in an introductory programming course? arXiv 2023, arXiv:2312.07343. Available online: https://arxiv.org/abs/2312.07343 (accessed on 13 September 2025).
Jukiewicz, M. The future of grading programming assignments in education: The role of ChatGPT in automating the assessment and feedback process. Think. Ski. Creat. 2024, 52, 101522. Available online: https://www.sciencedirect.com/science/article/pii/S1871187124000609 (accessed on 13 September 2025). [CrossRef]
Aytutuldu, I.; Yol, O.; Akgul, Y.S. Integrating LLMs for Grading and Appeal Resolution in Computer Science Education. arXiv 2025, arXiv:2504.13557. [Google Scholar] [CrossRef]
Mangalur, A.; Hegde, K.; Badachi, C.; Aamir, M. Transforming Student Evaluation with Adaptive Intelligence and Performance Analytics. arXiv 2025, arXiv:2503.04752. Available online: https://arxiv.org/abs/2503.04752 (accessed on 13 September 2025).
Escalante, J.; Pack, A.; Barrett, A. AI-generated feedback on writing: Insights into efficacy and ENL student preference. Int. J. Educ. Technol. High. Educ. 2023, 20, 57. Available online: https://link.springer.com/article/10.1186/s41239-023-00425-2 (accessed on 13 September 2025). [CrossRef]
Banihashem, S.K.; Kerman, N.T.; Noroozi, O.; Moon, J.; Drachsler, H. Feedback sources in essay writing: Peer-generated or AI-generated feedback? Int. J. Educ. Technol. High. Educ. 2024, 21, 23. Available online: https://educationaltechnologyjournal.springeropen.com/articles/10.1186/s41239-024-00455-4 (accessed on 13 September 2025). [CrossRef]
Chaudhry, I.S.; Sarwary, S.A.M.; El Refae, G.A.; Chabchoub, H. Time to revisit existing student’s performance evaluation approach in higher education sector in a new era of ChatGPT—A case study. Cogent Educ. 2023, 10, 2210461. Available online: https://www.tandfonline.com/doi/full/10.1080/2331186X.2023.2210461#abstract (accessed on 13 September 2025). [CrossRef]
Kumar, R. Faculty members’ use of artificial intelligence to grade student papers: A case of implications. Int. J. Educ. Integr. 2023, 19, 9. [Google Scholar] [CrossRef]
Ilieva, G.; Yankova, T.; Ruseva, M.; Kabaivanov, S. A Framework for Generative AI-Driven Assessment in Higher Education. Information 2025, 16, 472. [Google Scholar] [CrossRef]
Elsayed, S.; Cakir, D. Implementation of assessment and feedback in higher education. Acta Pedagog. Asiana 2023, 2, 34–42. Available online: https://tecnoscientifica.com/journal/apga/article/view/170 (accessed on 13 September 2025). [CrossRef]
Smolansky, A.; Cram, A.; Raduescu, C.; Zeivots, S.; Huber, E.; Kizilcec, R.F. Educator and student perspectives on the impact of generative AI on assessments in higher education. In Proceedings of the Tenth ACM Conference on Learning@ Scale, Copenhagen, Denmark, 20–22 July 2023; pp. 378–382. [Google Scholar]
Čep, A.; Bernik, A. ChatGPT and Artificial Intelligence in Higher Education: Literature Review Powered by Artificial Intelligence. In Proceedings of the Science and Information Conference, London, UK, 26–27 June 2024; Springer: Cham, Switzerland, 2024; pp. 240–248. [Google Scholar]

Table 1. Evaluation Rubric.

Criterion	Description	Points
Functional correctness	correct execution according to specification	0–5
Code structure and readability	clarity of expressions, logical organization, and variable naming	0–3
Commenting and documentation	presence of meaningful comments and explanations	0–1
Efficiency and optimization	solution quality in terms of complexity and resource usage	0–1

Table 2. Evaluation Metrics.

Metric Type	Metrics
Quantitative	Avg. score difference, SD of scores between AI and instructor, percentage of accurate predictions (±1 point tolerance), processing time per 100 assignments
Qualitative	Quality of generated feedback, detection of errors overlooked by instructors, interpretability of model decisions

Table 3. The performance of each model.

Metric	ChatGPT-4	Gemini 1.5	Claude 3
Pearson correlation with instructor	0.91	0.88	0.85
Average absolute score difference (0–10)	0.68 points	0.79 points	0.94 points
Grades within ±1 point of instructor	86%	81%	75%
Grading time per 100 assignments	~11 min	~8 min	~90 min *

* Claude 3 was used in an interactive (manual) mode, hence slower processing.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bernik, A.; Radošević, D.; Čep, A. A Comparative Study of Large Language Models in Programming Education: Accuracy, Efficiency, and Feedback in Student Assignment Grading. Appl. Sci. 2025, 15, 10055. https://doi.org/10.3390/app151810055

AMA Style

Bernik A, Radošević D, Čep A. A Comparative Study of Large Language Models in Programming Education: Accuracy, Efficiency, and Feedback in Student Assignment Grading. Applied Sciences. 2025; 15(18):10055. https://doi.org/10.3390/app151810055

Chicago/Turabian Style

Bernik, Andrija, Danijel Radošević, and Andrej Čep. 2025. "A Comparative Study of Large Language Models in Programming Education: Accuracy, Efficiency, and Feedback in Student Assignment Grading" Applied Sciences 15, no. 18: 10055. https://doi.org/10.3390/app151810055

APA Style

Bernik, A., Radošević, D., & Čep, A. (2025). A Comparative Study of Large Language Models in Programming Education: Accuracy, Efficiency, and Feedback in Student Assignment Grading. Applied Sciences, 15(18), 10055. https://doi.org/10.3390/app151810055

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comparative Study of Large Language Models in Programming Education: Accuracy, Efficiency, and Feedback in Student Assignment Grading

Abstract

1. Introduction

2. Related Work and Review of Existing AI Tools in Education

2.1. General Web-Based AI Tools

2.2. Limitations of Web-Based Tools

3. Methodology

3.1. Data Collection and Processing

3.2. Definition of Evaluation Criteria

3.3. AI-Based Evaluation

3.4. Result Analysis and Validation

4. Case Study: Preliminary Grading of Student Assignments

4.1. Quantitative Analysis of Grades

4.2. Quality of Feedback

5. Discussion and Recommendations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI